Reference for `MarkdownChunker`

The MarkdownChunker is used to process the documents parsed to a MarkdownDoc object.

Bases: AbstractChunker

`init(max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15, hard_max_chunk_token_count=None, tokenizer=None)`

Initialize a Markdown chunker

Parameters:

Name	Type	Description	Default
`max_headers_to_use`	`MaxHeadersToUse)`	The maximum header level to consider (included). Headers with level lower than this wont be used to split chunks. For example, if 'h4' is set, then 'h5' and 'h6' headers won't be used. Must be a string of type 'hx' with x being the title level ranging from 1 to 6.	`'h4'`
`max_chunk_word_count`	`int)`	The maximum size a chunk can be (in words). It is a SOFT limit, meaning that chunks bigger that this will be chunked only if lower level headers if any are available."	`200`
`hard_max_chunk_word_count`	`int)`	The true maximum size a chunk can be (in word). It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks.	`400`
`min_chunk_word_count`	`int)`	The minimum size a chunk can be (in words). Chunks smaller than this will be discarded.	`15`
`hard_max_chunk_token_count`	`None \| int)`	The true maximum size a chunk can be (in tokens). If None, no token-based splitting will be done. It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks that are equivalent in terms of tokens count.	`None`
`tokenizer`	`Any \| None)`	The tokenizer to use. Can be any instance of a class that has 'encode' method such as tiktoken.	`None`

`build_chunks(toc_tree_element, already_ok_chunks=None)`

Uses the toc tree to build the chunks. Uses recursion. Method : - build the chunk (= titles from sections above + section content + content of subsections) - if the chunk is too big: - save the section as title + content (if section has content) - subdivide section recursively using subsections - else save it as is

Parameters:

Name	Type	Description	Default
`toc_tree_element`	`TocTree`	the TocTree for which the chunk should be build	required
`already_ok_chunks`	`Chunks`	the chunks already built. Used for recursion. Defaults to None.	`None`

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	list of chunk's texts.

`chunk(content)`

Chunks a parsed Markdown document.

Parameters:

Name	Type	Description	Default
`content`	`MarkdownDoc`	the markdown document to chunk. Might be the output of a chunknorris.Parser.	required

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: the chunks.

`get_chunks(toc_tree)`

Wrapper that build the chunk's texts, check that they fit in size, replace links formatting.

Parameters:

Name	Type	Description	Default
`toc_tree`	`TocTree`	the toc tree of the document.	required

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the chunks text, formatted.

`get_parents_headers(toc_tree_element)` `staticmethod`

Gets a list of the titles that are parent of the provided toc tree element. The list is ordered in descending order in terms of header level.

Parameters:

Name	Type	Description	Default
`toc_tree_element`	`TocTree`	the toc tree element.	required

Returns:

Type	Description
`list[MarkdownLine]`	list[MarkdownLine]: the list of line that represent the parent's headers.

`get_toc_tree(md_lines)`

Builds the table of content tree based on header.

Parameters:

Name	Type	Description	Default
`md_lines`	`list[MarkdownLines]`	the markdown lines.	required

Returns:

Name	Type	Description
`TocTree`	`TocTree`	the table of content.

`remove_small_chunks(chunks)`

Removes chunks that have less words than the specified limit.

Parameters:

Name	Type	Description	Default
`chunks`	`Chunks`	the list of chunks.	required

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the chunks with more words than the specified threshold.

`split_big_chunks_tokenbased(chunks)`

Splits the chunks that are too big considering the provided tokenizer.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	the chunks to split.	required

Raises:

Type	Description
`ValueError`	if the tokenizer is not provided.
`ValueError`	if the tokenizer does not have 'encode' method.

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: the chunks, with big chunks splitting into smaller chunks.

`split_big_chunks_wordbased(chunks)`

Splits the chunks that are too big. You may consider passing the kwarg "hard_max_chunk_word_count" to specify the limit size of the chunk (in words).

Parameters:

Name	Type	Description	Default
`chunks`	`Chunks`	The chunks obtained from the get_chunks() method.	required

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the chunks, with big chunks splitted into smaller chunks.

Reference for MarkdownChunker

__init__(max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15, hard_max_chunk_token_count=None, tokenizer=None)

build_chunks(toc_tree_element, already_ok_chunks=None)

chunk(content)

get_chunks(toc_tree)

get_parents_headers(toc_tree_element) staticmethod

get_toc_tree(md_lines)

remove_small_chunks(chunks)

split_big_chunks_tokenbased(chunks)

split_big_chunks_wordbased(chunks)

Reference for `MarkdownChunker`

`init(max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15, hard_max_chunk_token_count=None, tokenizer=None)`

`build_chunks(toc_tree_element, already_ok_chunks=None)`

`chunk(content)`

`get_chunks(toc_tree)`

`get_parents_headers(toc_tree_element)` `staticmethod`

`get_toc_tree(md_lines)`

`remove_small_chunks(chunks)`

`split_big_chunks_tokenbased(chunks)`

`split_big_chunks_wordbased(chunks)`