# If needed, install chunknorris
%pip install chunknorris -q
PDF file chunking¶
This notebook aims at showing a simple example of chunking for PDF files.
Note: You may want to have a look at the tutorial In-depth .pdf file parsing to get more info about the functionnalities of the PdfParser
.
Pipeline setup¶
from chunknorris.parsers import PdfParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import PdfPipeline
from IPython.display import Markdown
Here we import the PdfPipeline
.
Note that BasePipeline
would work as well, but the PdfPipeline
handles more advanced mechanics specific to PDF files. For example, it will:
- split by pages the documents derived from powerpoint in which no table of content have been found.
- cleanup cached objects to avoid memory leaks
As the PdfParser
outputs a MarkdownDoc
, we use the MarkdownChunker
to chunk the parsed document.
# Setup the pipe. Feel free to play with the parser and chunker's arguments.
pipeline = PdfPipeline(
PdfParser(),
MarkdownChunker(),
)
chunks = pipeline.chunk_file("./data/sample.pdf")
print(f"Got {len(chunks)} chunks !")
2024-12-17 17:09:ChunkNorris:INFO:Function "_create_spans" took 0.4265 seconds 2024-12-17 17:09:ChunkNorris:INFO:Function "get_tables" took 1.3100 seconds 2024-12-17 17:09:ChunkNorris:INFO:Function "parse_file" took 2.1334 seconds 2024-12-17 17:09:ChunkNorris:INFO:Function "chunk" took 0.0536 seconds
Got 217 chunks !
As we can see, the chunking of this 165 pages documents took around:
- 2.1s for parsing (including 1.3s for parsing the tables)
- 0.05s for chunking.
--> around 2.2s total
It led to 217 chunks.
(Hardware : CPU - i7-13620H, 2.40 GHz, RAM - 16 Go)
View the chunks¶
To look at the chunk's text, you may use the Chunk.get_text()
method.
Another thing : for pdf file chunking, each chunk contains information about the pages this chunk comes from.
for chunk_idx in [10, 11]: # choose any
chunk = chunks[chunk_idx]
print(f"\n===== Start page: {chunk.start_page} --- End page: {chunk.end_page} ======\n")
print(chunk.get_text())
===== Start page: 11 --- End page: 12 ====== # Mitel 6930/6930w SIP Phone User Guide ## **Welcome** ### 2.4 Requirements The 6930 requires the following environment: - SIP-based IP PBX system or network installed and running with a SIP account created for the 6930 phone - Access to a Trithroughl File Transfer Protocol (TFTP), File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP) server, or Hyper Text Transfer Protocol over Secure Sockets Layer (SSL) (HTTPS) User Guide 6 - Ethernet/Fast Ethernet LAN (10/100 Mbps) (Gigabit Ethernet LAN [1000 Mbps] recommended) - Category 5/5e straight-through cabling (Category 6 straight-through cabling required for optimum Gigabit Ethernet performance) - Power source: - For Ethernet networks that supply inline power to the phone (IEEE 802.3af) use an Ethernet cable to connect from the phone directly to the network for power (no 48V AC power adapter required if using Power-over-Ethernet [PoE]) - For Ethernet networks that DO NOT supply power to the phone: - Use only the GlobTek Inc. Limited Power Source [LPS] adapter model no. GT-41080-1848(sold separately) to connect from the DC power port on the phone to a power source or - Use a PoE power injector or a PoE switch ===== Start page: 12 --- End page: 13 ====== # Mitel 6930/6930w SIP Phone User Guide ## **Welcome** ### 2.5 Installation and Setup If your System Administrator has not already setup your 6930 phone, please refer to the **Mitel 6930 Installation Guide** for basic installation and physical setup information. For more advanced administration and configuration information, System Administrators should refer to the **Mitel SIP IP Phones Administrator Guide .** **IP Phone Keys 3** This chapter contains the following sections: - Key Description - Dialpad Keys - E.164 support **Key Panel** The following sections describe the various 6930 phone key functions and how they can help you make and manage your calls and caller information. | q | Handset | a | Goodbye Key | |:-----|:--------------------------------|:-----|:------------------------------| | w | Speaker | s | Redial Key | | e | Message Waiting Indicator (MWI) | d | Hold Key | | r | Contacts Key | f | Mute Key | | t | Call History Key | g | Speaker/Headset Key | | y | Voicemail Key | h | Navigation Keys/Select Button | User Guide 8
Save the chunks¶
In order to save the chunks in a JSON file, just use this:
pipeline.save_chunks(chunks, "mychunks.json")