Reference for MarkdownChunker
The MarkdownChunker is used to process the documents parsed to a MarkdownDoc
object.
Bases: AbstractChunker
__init__(max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15)
Initialize a Markdown chunker
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_headers_to_use
|
MaxHeadersToUse)
|
The maximum header level to consider (included). Headers with level lower than this wont be used to split chunks. For example, if 'h4' is set, then 'h5' and 'h6' headers won't be used. Must be a string of type 'hx' with x being the title level ranging from 1 to 6. |
'h4'
|
max_chunk_word_count
|
int)
|
The maximum size a chunk can be (in words). It is a SOFT limit, meaning that chunks bigger that this size will be chunked using lower level headers if any are available." |
200
|
hard_max_chunk_word_count
|
int)
|
The true maximum size a chunk can be (in word). It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks. It should be greater than max_chunk_word_count." |
400
|
min_chunk_word_count
|
int)
|
The minimum size a chunk can be (in words). Chunks lower than this will be discarded. |
15
|
build_chunks(toc_tree_element, already_ok_chunks=None)
Uses the toc tree to build the chunks. Uses recursion. Method : - build the chunk (= titles from sections above + section content + content of subsections) - if the chunk is too big: - save the section as title + content (if section has content) - subdivide section recursively using subsections - else save it as is
Parameters:
Name | Type | Description | Default |
---|---|---|---|
toc_tree_element
|
TocTree
|
the TocTree for which the chunk should be build |
required |
already_ok_chunks
|
Chunks
|
the chunks already built. Used for recursion. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
list of chunk's texts |
chunk(content)
Chunks a parsed Markdown document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
MarkdownDoc
|
the markdown document to chunk. Might be the output of a chunknorris.Parser. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: the chunks |
get_chunks(toc_tree)
Wrapper that build the chunk's texts, check that they fit in size, replace links formatting.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
toc_tree
|
TocTree
|
the toc tree of the document |
required |
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the chunks text, formatted |
get_parents_headers(toc_tree_element)
staticmethod
Gets a list of the titles that are parent of the provided toc tree element. The list is ordered in descending order in terms of header level.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
toc_tree_element
|
TocTree
|
the toc tree element |
required |
Returns:
Type | Description |
---|---|
list[MarkdownLine]
|
list[MarkdownLine]: the list of line that represent the parent's headers |
get_toc_tree(md_lines)
Builds the table of content tree based on header
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_lines
|
list[MarkdownLines]
|
the markdown lines |
required |
Returns:
Name | Type | Description |
---|---|---|
TocTree |
TocTree
|
the table of content |
remove_small_chunks(chunks)
Removes chunks that have less words than the specified limit
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
Chunks
|
the list of chunks |
required |
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the chunks with more words than the specified threshold |
split_big_chunks(chunks)
Splits the chunks that are too big. You may consider passing the kwarg "hard_max_chunk_word_count" to specify the limit size of the chunk (in words)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
Chunks
|
The chunks obtained from the get_chunks() method |
required |
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the chunks, with big chunks splitting into smaller chunks |