Reference for MarkdownChunker
The MarkdownChunker is used to process the documents parsed to a MarkdownDoc object.
Bases: AbstractChunker
__init__(*, max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15, hard_max_chunk_token_count=None, tokenizer=None)
Initialize a Markdown chunker
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_headers_to_use (MaxHeadersToUse)
|
The maximum header level to consider (included). Headers with level lower than this wont be used to split chunks. For example, if 'h4' is set, then 'h5' and 'h6' headers won't be used. Must be a string of type 'hx' with x being the title level ranging from 1 to 6. |
required | |
max_chunk_word_count (int)
|
The maximum size a chunk can be (in words). It is a SOFT limit, meaning that chunks bigger that this will be chunked only if lower level headers if any are available." |
required | |
hard_max_chunk_word_count (int)
|
The true maximum size a chunk can be (in word). It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks. |
required | |
min_chunk_word_count (int)
|
The minimum size a chunk can be (in words). Chunks smaller than this will be discarded. |
required | |
hard_max_chunk_token_count (None | int)
|
The true maximum size a chunk can be (in tokens). If None, no token-based splitting will be done. It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks that are equivalent in terms of tokens count. |
required | |
tokenizer (SupportsEncode | None)
|
The tokenizer to use. Can be any instance of a class that has 'encode' method such as tiktoken. |
required |
build_chunks(toc_tree_element, parent_headers=None)
Uses the toc tree to build the chunks. Method: - if the section (title + content + all descendants) fits within max_chunk_word_count: - save it as a single chunk - otherwise: - save the section's own content as a chunk (if non-empty) - subdivide into children recursively
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
toc_tree_element
|
TocTree
|
the TocTree for which the chunk should be built. |
required |
parent_headers
|
list[MarkdownLine] | None
|
ancestor header lines. Defaults to None (root call). |
None
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: list of chunks. |
chunk(content)
Chunks a parsed Markdown document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
MarkdownDoc
|
the markdown document to chunk. Might be the output of a chunknorris.Parser. |
required |
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: the chunks. |
get_chunks(toc_tree)
Wrapper that build the chunk's texts, check that they fit in size, replace links formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
toc_tree
|
TocTree
|
the toc tree of the document. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Chunks |
list[Chunk]
|
the chunks text, formatted. |
get_toc_tree(md_lines)
Builds the table of content tree based on header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
md_lines
|
list[MarkdownLines]
|
the markdown lines. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
TocTree |
TocTree
|
the table of content. |
remove_small_chunks(chunks)
Removes chunks that have less words than the specified limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
Chunks
|
the list of chunks. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Chunks |
list[Chunk]
|
the chunks with more words than the specified threshold. |
split_big_chunks_tokenbased(chunks)
Splits the chunks that are too big considering the provided tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[Chunk]
|
the chunks to split. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
if the tokenizer is not provided. |
ValueError
|
if the tokenizer does not have 'encode' method. |
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: the chunks, with big chunks splitting into smaller chunks. |
split_big_chunks_wordbased(chunks)
Splits the chunks that are too big. You may consider passing the kwarg "hard_max_chunk_word_count" to specify the limit size of the chunk (in words).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
Chunks
|
The chunks obtained from the get_chunks() method. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Chunks |
list[Chunk]
|
the chunks, with big chunks splitted into smaller chunks. |