# If needed, install chunknorris
%pip install chunknorris -q
# utility functions
def print_chunking_result(chunks):
print(f"\n======= Got {len(chunks)} chunks ! ========\n")
for i, chunk in enumerate(chunks):
print(f"--------------------- chunk {i} ---------------------")
print(chunk.get_text())
Influence chunking behavior¶
One may want to influence how the chunks are built by passing parameters to the MarkdownChunker
. This notebook intends to git a feeling of "which parameter does what". Happy chunking ! 🔪
from chunknorris.parsers import MarkdownParser # <- you can use any parser you want as long as the are compatible with MarkdownChunker
from chunknorris.chunkers import MarkdownChunker # <- tutorial is essentially about this guy
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown
For this tutorial we will consider this easy Markdown :
md_string = """
# This is header 1
This is some intruction text after header 1
## This is SUBheader 1
This is some intruction text after header 1
### This is an h3 header
This is the content of the h3 subsection
### This is ANOTHER h3 header
This is the other content of the h3 subsection
"""
Markdown(md_string)
# Pipeline with default parameters:
pipeline = BasePipeline(MarkdownParser(), MarkdownChunker())
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:31:ChunkNorris:INFO:Function "chunk" took 0.0002 seconds
======= Got 1 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 This is some intruction text after header 1 ## This is SUBheader 1 This is some intruction text after header 1 ### This is an h3 header This is the content of the h3 subsection ### This is ANOTHER h3 header This is the other content of the h3 subsection
Impact of each argument¶
max_chunk_word_count¶
We can see that we got only one chunk, despite the presence of headers !
This is due to MarkdownChunker
's parameter max_chunk_word_count
. The default value is 200
, meaning that the chunker will try to make chunks of approximately 200 words. I a chunk is bigger that this, only then it will be chunked using its header.
This may sound weird, but embedding models are still sensitive to the length of the text. Consequently, "I have a dog" may be more similar to "I love electronic music" than a whole paragraph about dogs. 💡 By ensuring the resulting chunks are of similar sizes, we minimize the influence of the chunk's size in the embedding and make it more about the chunk's meaning.
Let's play with max_chunk_word_count
a bit :
chunker = MarkdownChunker(
max_chunk_word_count=50,
min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
)
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:32:ChunkNorris:INFO:Function "chunk" took 0.0003 seconds
======= Got 2 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 This is some intruction text after header 1 --------------------- chunk 1 --------------------- # This is header 1 ## This is SUBheader 1 This is some intruction text after header 1 ### This is an h3 header This is the content of the h3 subsection ### This is ANOTHER h3 header This is the other content of the h3 subsection
As we can see, with parameter max_chunk_word_count=50
the chunk is introduction part is split from the rest to make sure both chunks are below 50 words. We could even decrease that number to split the second chunk even more.
💡 Pro tip : Want to make sure all the headers are used to build chunks ? Just set max_chunk_word_count=0
and the chunker will try to make chunks of 0 word count, hence using all headers available.
chunker = MarkdownChunker(
max_chunk_word_count=0,
min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
)
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:36:ChunkNorris:INFO:Function "chunk" took 0.0003 seconds
======= Got 4 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 This is some intruction text after header 1 --------------------- chunk 1 --------------------- # This is header 1 ## This is SUBheader 1 This is some intruction text after header 1 --------------------- chunk 2 --------------------- # This is header 1 ## This is SUBheader 1 ### This is an h3 header This is the content of the h3 subsection --------------------- chunk 3 --------------------- # This is header 1 ## This is SUBheader 1 ### This is ANOTHER h3 header This is the other content of the h3 subsection
max_headers_to_use¶
You may want to get chunks as small as possible, but avoid using headers level that are too low. Indeed, it is common that list-items in html are h5 headers and you wouldn't want each item in the list ot be a chunk.
By default, MarkdownChunker will only use headers up to H4
, and won't use h5 and h6. But let's change this and see how it affects behavior.
chunker = MarkdownChunker(
max_chunk_word_count=0,
max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
)
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:41:ChunkNorris:INFO:Function "chunk" took 0.0002 seconds
======= Got 2 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 This is some intruction text after header 1 --------------------- chunk 1 --------------------- # This is header 1 ## This is SUBheader 1 This is some intruction text after header 1 ### This is an h3 header This is the content of the h3 subsection ### This is ANOTHER h3 header This is the other content of the h3 subsection
Now we only have 2 chunks, as h3 headers were not allowed to be used to split the chunks.
hard_max_chunk_word_count¶
Now, we saw that forbidding the MarkdownChunker
to use headers will enforce chunks to be bigger that that what is requested by the max_chunk_word_count
parameter. The same happens if no header is available in the markdown : the chunker will try to make chunks of requested size, but if no header is available it will leave the chunk "as is".
This may lead to veeeery big chunks and we don't want that (and most embedding API will trigger an error if your chunk is bigger than the model's context window).
That's when the hard_max_chunk_word_count
comes into play. This parameter allows you to set a kind of hard limit for the chunk. Chunks bigger that the limite will be splitted to fit te limit. Why kind of ? Because MarkdownChunker
will avoid splitting in the middle of a code block, or in a table so you may still have resulting chunks that are slightly bigger that this.
chunker = MarkdownChunker(
max_chunk_word_count=0,
max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
hard_max_chunk_word_count=20,
min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
)
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:52:ChunkNorris:INFO:Function "chunk" took 0.0004 seconds
======= Got 3 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 This is some intruction text after header 1 --------------------- chunk 1 --------------------- # This is header 1 ## This is SUBheader 1 This is some intruction text after header 1 ### This is an h3 header This is the content of the h3 subsection --------------------- chunk 2 --------------------- # This is header 1 ## This is SUBheader 1 ### This is ANOTHER h3 header This is the other content of the h3 subsection
There we go, the second chunk from before has been splitted in 2 !
min_chunk_word_count¶
When all this chunking is happening, you may have small chunks remaining. This is frequent for webscrapping for example, where some pages will be empty or just have a title.
As said before, these small chunk can be a pain because in the context of information retrieval based on queries thay tend to come up because of their similar size with the queries.
To avoid having these small chunks lost in a database full of great chunks, the min_chunk_word_count
argument is here. This will allow chunks will less words than the limit to be automatically discarded. The default value is 15
but you may to set it to 0 if you absolutely wish to keep every chunks.
chunker = MarkdownChunker(
max_chunk_word_count=0,
max_headers_to_use="h2",
hard_max_chunk_word_count=20,
min_chunk_word_count=10 # Discard chunks will less than 10 words (excluding headers)
)
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
2024-12-17 10:59:ChunkNorris:INFO:Function "chunk" took 0.0003 seconds
======= Got 2 chunks ! ======== --------------------- chunk 0 --------------------- # This is header 1 ## This is SUBheader 1 This is some intruction text after header 1 ### This is an h3 header This is the content of the h3 subsection --------------------- chunk 1 --------------------- # This is header 1 ## This is SUBheader 1 ### This is ANOTHER h3 header This is the other content of the h3 subsection
There you go. The first chunk from before has been dicarded.
Work with the TOC tree¶
To build the chunks according to the markdown headers, the MarkdownChunker uses a TocTree
object. The TocTree
represents the table of content, and the content of each part.
Whether it is for debugging, or because you want to implement some custom chunking strategy, you may want to have a look at the table of content that has been parsed from your document.
# Get the MarkdownDoc that can be fed to the chunker
parser = MarkdownParser()
md_doc = parser.parse_string(md_string)
# New, get the TocTree
chunker = MarkdownChunker()
toc_tree = chunker.get_toc_tree(md_doc.content)
# Save the TocTree to have a look at it
toc_tree.to_json(output_path="toc_tree.json")
Conclusion¶
🧪 Feel free to experiments with these parameters to get the chunks that suit your data.
If this still seem a bit obscure, don't worry the default parameters have been tested on multiple custom dataset and have proven to work well 😉 !