# If needed, install chunknorris
%pip install chunknorris -q
Note: you may need to restart the kernel to use updated packages.
In-depth .pdf file parsing¶
This tutorial shows you how to use the PdfParser efficiently.
from chunknorris.parsers import PdfParser
from chunknorris import set_log_level
from IPython.display import Markdown
set_log_level("info")
# Use the following block to parse a pdf file from an url
# import requests
# r = requests.get("myurl.pdf")
# data = r.content
# parser = PdfParser()
# parsed_doc = parser.parse_string(data)
path_to_pdf = "./example_data/sample.pdf" # Mitel phones user manual, 265 pages.
# Instanciate parser and parse (should take around 2s)
parser = PdfParser()
parsed_doc = parser.parse_file(path_to_pdf)
INFO - ChunkNorris - Function "get_tables" took 0.6218 seconds INFO - ChunkNorris - Function "parse_file" took 1.5131 seconds
As we can see, the total time elapsed to parse the 265 pages file is around 1.5s ! (including 0.6s to parse the table, and 0.9s to parse the text). This will vary depending on your hardware, the amount of tables in the document or the need to do OCR.
If you are certain your documents do not contain tables, or do not need OCR, you may use the following code to make it even faster :
parser = PdfParser(
extract_tables=False,
use_ocr="never"
)
Get the markdown string¶
# Let's view a sample of the document
md_string = parsed_doc.to_string()
Markdown(
"___________\n[...]" + md_string[26400:34000] + "[...]\n_______________")
[...] the network and contains advanced configuration instructions. The Administrator Guide is intended for the System Administrator and can be downloaded from http://www.miteldocs.com .
2.3 Phone Features¶
The following table describes the IP Phone features:
| Feature | 6930 IP Phone | 6930w IP Phone |
|---|---|---|
| Display | 4.3” WQVGA (480x272) color TFT LCD display with brightness controls | 4.3" WQVGA (480x272) color TFT LCD display with brightness controls |
| Programmable Keys | 12 top softkeys | 12 top softkeys |
| Context Sensitive Keys | 5 context-sensitive bottom softkeys | 5 context-sensitive bottom softkeys |
| Ethernet | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer 802.3az (EEE) |
| Power-over-Ethernet (PoE) - LAN | 802.3af, 802.3at | 802.3af, 802.3at |
| POE Class | Class 3 with auto change to 4 when PKMs are attached. | Class 3 with auto change to 4 when PKMs are attached. If an accessory is installed in the sidecar accessory port, the phone must be powered using a 48v power brick. |
| Bluetooth Support | Embedded Bluetooth 4.1 | Embedded Bluetooth 5.2 |
| External USB Port | 1x USB 2.0 (100mA) Host | 1x USB 2.0 (500mA) Host |
| PC Link / Mobile Link | Yes | Yes |
| 802.11n Wi-Fi | - | Yes (built-in) |
| Antimicrobial Plastics | No | No |
| DHSG Headset Support (H20/40) | Yes | Yes |
| Feature | 6930 IP Phone | 6930w IP Phone |
| :--- | :--- | :--- |
| USB Headset Support (H10/30/40) | Yes | Yes |
| S720 BT Speakerphone | Yes | Yes |
| Integrated DECT Headset | Yes | Yes |
| M695 Programmable Key Module | Yes (3 max) | Yes (3 max) |
| Press-and-hold Speed dial key configuration feature | Yes | Yes |
| Call Lines | Supports up to 24 call lines with LEDs | Supports up to 24 call lines with LEDs |
| AC power adapter | Yes. Sold separately | Yes. Sold separately |
| Supports Cordless Bluetooth handset | Yes | Yes |
Note : The 6930L and 6930Lt IP Phone variants do not contain Bluetooth circuitry and so do not support the related wireless functions. Any information within this document related to radio performance or functionality only relates to the fully functional 6930 IP Phone with Bluetooth capability.
2.4 Requirements¶
The 6930 requires the following environment:
- SIP-based IP PBX system or network installed and running with a SIP account created for the 6930
phone
- Access to a Trithroughl File Transfer Protocol (TFTP), File Transfer Protocol (FTP), Hypertext Transfer
Protocol (HTTP) server, or Hyper Text Transfer Protocol over Secure Sockets Layer (SSL) (HTTPS)
- Category 5/5e straight-through cabling (Category 6 straight-through cabling required for optimum
Gigabit Ethernet performance)
- Power source:
- For Ethernet networks that supply inline power to the phone (IEEE 802.3af) use an Ethernet cable to
connect from the phone directly to the network for power (no 48V AC power adapter required if using Power-over-Ethernet [PoE])
- For Ethernet networks that DO NOT supply power to the phone:
- Use only the GlobTek Inc. Limited Power Source [LPS] adapter model no. GT-41080-1848(sold
separately) to connect from the DC power port on the phone to a power source or
- Use a PoE power injector or a PoE switch
2.5 Installation and Setup¶
If your System Administrator has not already setup your 6930 phone, please refer to the Mitel 6930 Installation Guide for basic installation and physical setup information. For more advanced administration and configuration information, System Administrators should refer to the Mitel SIP IP Phones Administrator Guide .
IP Phone Keys 3¶
This chapter contains the following sections:
- Key Description
- Dialpad Keys
- E.164 support Key Panel The following sections describe the various 6930 phone key functions and how they can help you make and manage your calls and caller information. | q | Handset | a | Goodbye Key | |:---|:---|:---|:---| | w | Speaker | s | Redial Key | | e | Message Waiting Indicator (MWI) | d | Hold Key | | r | Contacts Key | f | Mute Key | | t | Call History Key | g | Speaker/Headset Key | | y | Voicemail Key | h | Navigation Keys/Select Button | | u | Settings Key | j | State-Sensitive Softkeys | |:---|:---|:---|:---| | i | Volume Control | k | Programmable Keys | | o | Dialpad | l | LCD Screen |
3.1 Key Description¶
The following table describes the keys on the 6930:
| Key | Description |
|---|---|
| Directory key - Displays a list of your contacts. For more information, see Directory . | |
| Call History key - Call History key displays All folder list which includes the list of your missed, outgoing, and received calls. For more information, see Call History Key . | |
| Voicemail key - Provides access to your voicemail service (if configured). For more information, see Voicemail . | |
| Settings key- Provides services and static settings that allow you to customize your phone. For more information, see Customizing Your Phone . | |
| Volume controls - Adjusts the volume for the ringer, handset, headset, and speakerphone. Press the volume control keys while the phone is ringing to adjust the ringer volume. Pressing these keys during an active call adjusts the volume of the audio device being used (handset, headset, or speaker). | |
| Goodbye key- Ends an active call. The Goodbye key also exits an open list (such as Call History) and menus (such as the Static Settings menu) without saving changes. | |
| Key | Description |
| :--- | :--- |
| Redial key - Displays a list of your previously dialed calls. Pressing the Redial key twice redials the last dialed number displayed on the Home screen. For more information, see Outgoing Redial Key . | |
| Hold key - Places an active call on hold. To retrieve a held call, press the applicable Linekey. For more information, see Placing a Call on Hold . | |
| Mute key- Mutes the microphone so that your caller cannot hear you (the LED beside the key turns on when the microphone is on mute). For more information, see Mute . | |
| Speaker/Headset key - Transfers the active call to the speaker or headset, allowing handsfree use of the phone. For more information, see Audio . | |
| Navigation keys and select button - Multi-directional navigation keys that allow you to navigate through the phone’s User Interface (UI). Pressing the center Selectbutton sets options as well as performs actions such as dialing out from the Contacts or Call History. On the Home screen, the left and right navigation keys can be used to access the additional pages of programmable softkeys. For more information, see UI Navigation . | |
| Bottom softkeys- Five state-sensitive bottom softkeys that allow you to perform different functions during specific states (i.e. when the phone is an idle, connected, incoming, outgoing, or busy state). | |
| Top softkeys- 12 multi-function self-labeling keys that allow you to use up to 44 specific functions. For more information see Configuring Softkeys |
3.2 Dialpad Keys¶
The 6930 has a dialpad with digits from 0 through 9, a * key, and a # key. Keys 2 through 9 contain the letters of the alphabet. The 6930 phone dialpad includes the following:
| Dialpad Key | Description |
|---|---|
| 0 | Dials 0 Dials the Operator on a registered phone |
| 1 | Dials 1 |
[...]
As we can see, the output markdown looks fine. The title have been recognized and converted as markdown headers, as well as the tables.
Concerning the headers, chunknorris will attenmpt to find the table of content :
- In the file's metadata.
- If not available it will attempt to find it in the document using regex
- If still not available it will infer the headers based to font size.
Observe table of content¶
# Let's diplay the first elements of the table of content :
print(f"Amount of TOC items detected : {len(parser.toc)}.\nSample : ")
parser.toc[:4]
Amount of TOC items detected : 139. Sample :
[TocTitle(text='Contents', source='metadata', page=3, level=1, x_offset=None, source_page=None, found=True), TocTitle(text='What’s New', source='metadata', page=7, level=1, x_offset=None, source_page=None, found=True), TocTitle(text='Welcome', source='metadata', page=8, level=1, x_offset=None, source_page=None, found=True), TocTitle(text='About this Guide', source='metadata', page=10, level=2, x_offset=None, source_page=None, found=True)]
Observe the detected tables¶
# You may also want to look at the tables detected
print(f"Amount of tables detected : {len(parser.tables)}")
table_idx = 2 # Choose the idx you want
Markdown(parser.tables[table_idx].to_markdown())
Amount of tables detected : 80
| Feature | 6930 IP Phone | 6930w IP Phone |
|---|---|---|
| USB Headset Support (H10/30/40) | Yes | Yes |
| S720 BT Speakerphone | Yes | Yes |
| Integrated DECT Headset | Yes | Yes |
| M695 Programmable Key Module | Yes (3 max) | Yes (3 max) |
| Press-and-hold Speed dial key configuration feature | Yes | Yes |
| Call Lines | Supports up to 24 call lines with LEDs | Supports up to 24 call lines with LEDs |
| AC power adapter | Yes. Sold separately | Yes. Sold separately |
| Supports Cordless Bluetooth handset | Yes | Yes |
To parse tables, chunknorris uses the "lines" visible on the pages. Please take note that the tables that have a suggested structure (i.e. no lines) may not be detected as tables. Most of the time, that won't be a problem as such tables displayed as text will still be well undersood by LLMs.
Visualize the parsed pdf¶
You may want to "plot" the parsed pdf file. This can help you :
- debug
- visualize which tables have been detected
- visualize which elements have been detected has page headers/footers
- ...
# Let's plot the parsed pdf
parser.plot_pdf(page_start=75, page_end=78, dpi=100) # avoid displaying more that 20 pages at a time as matplotlib uses lots of RAM
The plotted pdf elements are :
- headers/footers in red
- tables in blue
- text in yellow including :
- text lines surrounded by black thin lines
- text blocks (= group of lines) surrounded with dashed black lines