In [ ]:

Copied!

# If needed, install chunknorris
%pip install chunknorris -q
# If needed, install chunknorris
%pip install chunknorris -q

In [7]:

Copied!





# utility functions
def print_chunking_result(chunks):
    print(f"\n======= Got {len(chunks)} chunks ! ========\n")
    for i, chunk in enumerate(chunks):
        print(f"--------------------- chunk {i} ---------------------")
        print(chunk.get_text())
# utility functions
def print_chunking_result(chunks):
    print(f"\n======= Got {len(chunks)} chunks ! ========\n")
    for i, chunk in enumerate(chunks):
        print(f"--------------------- chunk {i} ---------------------")
        print(chunk.get_text())

Influence chunking behavior¶

One may want to influence how the chunks are built by passing parameters to the MarkdownChunker. This notebook intends to git a feeling of "which parameter does what". Happy chunking ! 🔪

In [8]:

Copied!





from chunknorris.parsers import MarkdownParser # <- you can use any parser you want as long as the are compatible with MarkdownChunker
from chunknorris.chunkers import MarkdownChunker # <- tutorial is essentially about this guy
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown
from chunknorris.parsers import MarkdownParser # <- you can use any parser you want as long as the are compatible with MarkdownChunker
from chunknorris.chunkers import MarkdownChunker # <- tutorial is essentially about this guy
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown

For this tutorial we will consider this easy Markdown :

In [9]:

Copied!

md_string = """
# This is header 1

This is some intruction text after header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection

### This is ANOTHER h3 header

This is the other content of the h3 subsection
"""
Markdown(md_string)
md_string = """
# This is header 1

This is some intruction text after header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection

### This is ANOTHER h3 header

This is the other content of the h3 subsection
"""
Markdown(md_string)

Out[9]:

This is header 1¶

This is some intruction text after header 1

This is SUBheader 1¶

This is some intruction text after header 1

This is an h3 header¶

This is the content of the h3 subsection

This is ANOTHER h3 header¶

This is the other content of the h3 subsection

In [10]:

Copied!





# Pipeline with default parameters:
pipeline = BasePipeline(MarkdownParser(), MarkdownChunker())
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
# Pipeline with default parameters:
pipeline = BasePipeline(MarkdownParser(), MarkdownChunker())
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:09:ChunkNorris:INFO:Function "chunk" took 0.0001 seconds

======= Got 1 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection

### This is ANOTHER h3 header

This is the other content of the h3 subsection

Impact of each argument¶

max_chunk_word_count¶

We can see that we got only one chunk, despite the presence of headers !

This is due to MarkdownChunker's parameter max_chunk_word_count. The default value is 200, meaning that the chunker will try to make chunks of approximately 200 words. I a chunk is bigger that this, only then it will be chunked using its header.

This may sound weird, but embedding models are still sensitive to the length of the text. Consequently, "I have a dog" may be more similar to "I love electronic music" than a whole paragraph about dogs. 💡 By ensuring the resulting chunks are of similar sizes, we minimize the influence of the chunk's size in the embedding and make it more about the chunk's meaning.

Let's play with max_chunk_word_count a bit :

In [11]:

Copied!





chunker = MarkdownChunker(
    max_chunk_word_count=50, 
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
chunker = MarkdownChunker(
    max_chunk_word_count=50, 
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:09:ChunkNorris:INFO:Function "chunk" took 0.0002 seconds

======= Got 2 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1
--------------------- chunk 1 ---------------------
# This is header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection

### This is ANOTHER h3 header

This is the other content of the h3 subsection

As we can see, with parameter max_chunk_word_count=50 the chunk is introduction part is split from the rest to make sure both chunks are below 50 words. We could even decrease that number to split the second chunk even more.

💡 Pro tip : Want to make sure all the headers are used to build chunks ? Just set max_chunk_word_count=0 and the chunker will try to make chunks of 0 word count, hence using all headers available.

In [12]:

Copied!





chunker = MarkdownChunker(
    max_chunk_word_count=0, 
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
chunker = MarkdownChunker(
    max_chunk_word_count=0, 
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:09:ChunkNorris:INFO:Function "chunk" took 0.0002 seconds

======= Got 4 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1
--------------------- chunk 1 ---------------------
# This is header 1

## This is SUBheader 1

This is some intruction text after header 1
--------------------- chunk 2 ---------------------
# This is header 1

## This is SUBheader 1

### This is an h3 header

This is the content of the h3 subsection
--------------------- chunk 3 ---------------------
# This is header 1

## This is SUBheader 1

### This is ANOTHER h3 header

This is the other content of the h3 subsection

max_headers_to_use¶

You may want to get chunks as small as possible, but avoid using headers level that are too low. Indeed, it is common that list-items in html are h5 headers and you wouldn't want each item in the list ot be a chunk.

By default, MarkdownChunker will only use headers up to H4, and won't use h5 and h6. But let's change this and see how it affects behavior.

In [13]:

Copied!





chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:09:ChunkNorris:INFO:Function "chunk" took 0.0001 seconds

======= Got 2 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1
--------------------- chunk 1 ---------------------
# This is header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection

### This is ANOTHER h3 header

This is the other content of the h3 subsection

Now we only have 2 chunks, as h3 headers were not allowed to be used to split the chunks.

hard_max_chunk_word_count¶

Now, we saw that forbidding the MarkdownChunker to use headers will enforce chunks to be bigger that that what is requested by the max_chunk_word_count parameter. The same happens if no header is available in the markdown : the chunker will try to make chunks of requested size, but if no header is available it will leave the chunk "as is".

This may lead to veeeery big chunks and we don't want that (and most embedding API will trigger an error if your chunk is bigger than the model's context window).

That's when the hard_max_chunk_word_count comes into play. This parameter allows you to set a kind of hard limit for the chunk. Chunks bigger that the limit will be splitted to fit the limit. Why kind of ? Because MarkdownChunker will avoid splitting in the middle of a code block, or in a table so you may still have resulting chunks that are slightly bigger that this.

In [14]:

Copied!





chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
    hard_max_chunk_word_count=20,
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2", # <- only use h1 and h2 to split chunks
    hard_max_chunk_word_count=20,
    min_chunk_word_count=0 # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:09:ChunkNorris:INFO:Function "chunk" took 0.0002 seconds

======= Got 3 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1
--------------------- chunk 1 ---------------------
# This is header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection
--------------------- chunk 2 ---------------------
# This is header 1

## This is SUBheader 1

### This is ANOTHER h3 header

This is the other content of the h3 subsection

There we go, the second chunk from before has been splitted in 2 !

hard_max_chunk_token_count and tokenizer¶

This parameter allow to set an actual hard limit in terms of token to avoid any errors regarding API calls to embedding model providers.

When hard_max_chunk_token_count is set to an int value, the provided tokenizer will be used to count tokens. Chunks bigger than the value specified will be split into subchunks, trying to equilibrate their size and considering newlines to avoid random cuts.

The provided tokenizer MUST have an encode(str) -> list[int] method, that takes a string as input, and returns a list of tokens. For example, the tiktoken package nativiely have such method.

In [17]:

Copied!





import tiktoken

chunker = MarkdownChunker(
    min_chunk_word_count=0, # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    hard_max_chunk_token_count=20,
    tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
import tiktoken

chunker = MarkdownChunker(
    min_chunk_word_count=0, # we set this to 0 because the chunker automatically discard chunks below 15 words by default
    hard_max_chunk_token_count=20,
    tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2025-05-15 09:10:ChunkNorris:INFO:Function "chunk" took 0.0005 seconds

======= Got 4 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

This is some intruction text after header 1
--------------------- chunk 1 ---------------------
## This is SUBheader 1

This is some intruction text after header 1
--------------------- chunk 2 ---------------------
### This is an h3 header

This is the content of the h3 subsection
--------------------- chunk 3 ---------------------
### This is ANOTHER h3 header

This is the other content of the h3 subsection

min_chunk_word_count¶

When all this chunking is happening, you may have small chunks remaining. This is frequent for webscrapping for example, where some pages will be empty or just have a title.

As said before, these small chunk can be a pain because in the context of information retrieval based on queries thay tend to come up because of their similar size with the queries.

To avoid having these small chunks lost in a database full of great chunks, the min_chunk_word_count argument is here. This will allow chunks will less words than the limit to be automatically discarded. The default value is 15 but you may to set it to 0 if you absolutely wish to keep every chunks.

In [34]:

Copied!





chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2",
    hard_max_chunk_word_count=20,
    min_chunk_word_count=10 # Discard chunks will less than 10 words (excluding headers)
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)
chunker = MarkdownChunker(
    max_chunk_word_count=0,
    max_headers_to_use="h2",
    hard_max_chunk_word_count=20,
    min_chunk_word_count=10 # Discard chunks will less than 10 words (excluding headers)
    )
pipeline = BasePipeline(MarkdownParser(), chunker)
chunks = pipeline.chunk_string(md_string)
print_chunking_result(chunks)

2024-12-17 10:59:ChunkNorris:INFO:Function "chunk" took 0.0003 seconds

======= Got 2 chunks ! ========

--------------------- chunk 0 ---------------------
# This is header 1

## This is SUBheader 1

This is some intruction text after header 1

### This is an h3 header

This is the content of the h3 subsection
--------------------- chunk 1 ---------------------
# This is header 1

## This is SUBheader 1

### This is ANOTHER h3 header

This is the other content of the h3 subsection

There you go. The first chunk from before has been dicarded.

Work with the TOC tree¶

To build the chunks according to the markdown headers, the MarkdownChunker uses a TocTree object. The TocTree represents the table of content, and the content of each part.

Whether it is for debugging, or because you want to implement some custom chunking strategy, you may want to have a look at the table of content that has been parsed from your document.

In [9]:

Copied!





# Get the MarkdownDoc that can be fed to the chunker
parser = MarkdownParser()
md_doc = parser.parse_string(md_string)

# New, get the TocTree
chunker = MarkdownChunker()
toc_tree = chunker.get_toc_tree(md_doc.content)

# Save the TocTree to have a look at it
toc_tree.to_json(output_path="toc_tree.json")
# Get the MarkdownDoc that can be fed to the chunker
parser = MarkdownParser()
md_doc = parser.parse_string(md_string)

# New, get the TocTree
chunker = MarkdownChunker()
toc_tree = chunker.get_toc_tree(md_doc.content)

# Save the TocTree to have a look at it
toc_tree.to_json(output_path="toc_tree.json")

Conclusion¶

🧪 Feel free to experiments with these parameters to get the chunks that suit your data.

If this still seem a bit obscure, don't worry the default parameters have been tested on multiple custom dataset and have proven to work well 😉 !