In [ ]:
Copied!
# If needed, install chunknorris
%pip install chunknorris -q
# If needed, install chunknorris
%pip install chunknorris -q
PDF file chunking¶
This notebook aims at showing a simple example of chunking for PDF files.
Note: You may want to have a look at the tutorial In-depth .pdf file parsing to get more info about the functionnalities of the PdfParser
.
Pipeline setup¶
In [1]:
Copied!
from chunknorris.parsers import PdfParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown
from chunknorris.parsers import PdfParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown
In [4]:
Copied!
# Setup the pipe. Feel free to play with the parser and chunker's arguments.
pipeline = BasePipeline(
PdfParser(),
MarkdownChunker(),
)
chunks = pipeline.chunk_file("./example_data/sample.pdf")
print(f"Got {len(chunks)} chunks !")
# Setup the pipe. Feel free to play with the parser and chunker's arguments.
pipeline = BasePipeline(
PdfParser(),
MarkdownChunker(),
)
chunks = pipeline.chunk_file("./example_data/sample.pdf")
print(f"Got {len(chunks)} chunks !")
2025-07-01 15:39:ChunkNorris:INFO:Function "get_tables" took 1.4941 seconds 2025-07-01 15:39:ChunkNorris:INFO:Function "parse_file" took 2.3959 seconds 2025-07-01 15:39:ChunkNorris:INFO:Function "chunk" took 0.0955 seconds
Got 218 chunks !
As we can see, the chunking of this 165 pages documents took around:
- 2.4s for parsing (including 1.3s for parsing the tables)
- 0.1s for chunking.
--> around 2.5s total
It led to 218 chunks.
(Hardware : CPU - i7-13620H, 2.40 GHz, RAM - 16 Go)
View the chunks¶
To look at the chunk's text, you may use the Chunk.get_text()
method.
Another thing : for pdf file chunking, each chunk contains information about the pages this chunk comes from.
In [5]:
Copied!
for chunk_idx in [10, 11]: # choose any
chunk = chunks[chunk_idx]
print(f"\n===== Start page: {chunk.start_page} --- End page: {chunk.end_page} ======\n")
print(chunk.get_text())
for chunk_idx in [10, 11]: # choose any
chunk = chunks[chunk_idx]
print(f"\n===== Start page: {chunk.start_page} --- End page: {chunk.end_page} ======\n")
print(chunk.get_text())
===== Start page: 9 --- End page: 11 ====== ## **Welcome** ### 2.3 Phone Features The following table describes the IP Phone features: User Guide 4 **Welcome** | Feature | 6930 IP Phone | 6930w IP Phone | |:---|:---|:---| | Display | 4.3” WQVGA (480x272) color TFT LCD display with brightness controls | 4.3" WQVGA (480x272) color TFT LCD display with brightness controls | | Programmable Keys | 12 top softkeys | 12 top softkeys | | Context Sensitive Keys | 5 context-sensitive bottom softkeys | 5 context-sensitive bottom softkeys | | Ethernet | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer 802.3az (EEE) | | Power-over-Ethernet (PoE) - LAN | 802.3af, 802.3at | 802.3af, 802.3at | | POE Class | Class 3 with auto change to 4 when PKMs are attached. | Class 3 with auto change to 4 when PKMs are attached. If an accessory is installed in the sidecar accessory port, the phone must be powered using a 48v power brick. | | Bluetooth Support | Embedded Bluetooth 4.1 | Embedded Bluetooth 5.2 | | External USB Port | 1x USB 2.0 (100mA) Host | 1x USB 2.0 (500mA) Host | | PC Link / Mobile Link | Yes | Yes | | 802.11n Wi-Fi | - | Yes (built-in) | | Antimicrobial Plastics | No | No | | DHSG Headset Support (H20/40) | Yes | Yes | 5 User Guide **Welcome** | Feature | 6930 IP Phone | 6930w IP Phone | |:---|:---|:---| | USB Headset Support (H10/30/40) | Yes | Yes | | S720 BT Speakerphone | Yes | Yes | | Integrated DECT Headset | Yes | Yes | | M695 Programmable Key Module | Yes (3 max) | Yes (3 max) | | Press-and-hold Speed dial key configuration feature | Yes | Yes | | Call Lines | Supports up to 24 call lines with LEDs | Supports up to 24 call lines with LEDs | | AC power adapter | Yes. Sold separately | Yes. Sold separately | | Supports Cordless Bluetooth handset | Yes | Yes | **Note** : The **6930L** and **6930Lt** IP Phone variants do not contain Bluetooth circuitry and so do not support the related wireless functions. Any information within this document related to radio performance or functionality only relates to the fully functional 6930 IP Phone with Bluetooth capability. ===== Start page: 11 --- End page: 12 ====== ## **Welcome** ### 2.4 Requirements The 6930 requires the following environment: - SIP-based IP PBX system or network installed and running with a SIP account created for the 6930 phone - Access to a Trithroughl File Transfer Protocol (TFTP), File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP) server, or Hyper Text Transfer Protocol over Secure Sockets Layer (SSL) (HTTPS) User Guide 6 - Ethernet/Fast Ethernet LAN (10/100 Mbps) (Gigabit Ethernet LAN [1000 Mbps] recommended) - Category 5/5e straight-through cabling (Category 6 straight-through cabling required for optimum Gigabit Ethernet performance) - Power source: - For Ethernet networks that supply inline power to the phone (IEEE 802.3af) use an Ethernet cable to connect from the phone directly to the network for power (no 48V AC power adapter required if using Power-over-Ethernet [PoE]) - For Ethernet networks that DO NOT supply power to the phone: - Use only the GlobTek Inc. Limited Power Source [LPS] adapter model no. GT-41080-1848(sold separately) to connect from the DC power port on the phone to a power source or - Use a PoE power injector or a PoE switch
Save the chunks¶
In order to save the chunks in a JSON file, just use this:
In [16]:
Copied!
pipeline.save_chunks(chunks, "mychunks.json")
pipeline.save_chunks(chunks, "mychunks.json")