Skip to content

Reference for MarkdownChunker

The MarkdownChunker is used to process the documents parsed to a MarkdownDoc object.

Bases: AbstractChunker

__init__(max_headers_to_use='h4', max_chunk_word_count=200, hard_max_chunk_word_count=400, min_chunk_word_count=15)

Initialize a Markdown chunker

Parameters:

Name Type Description Default
max_headers_to_use MaxHeadersToUse)

The maximum header level to consider (included). Headers with level lower than this wont be used to split chunks. For example, if 'h4' is set, then 'h5' and 'h6' headers won't be used. Must be a string of type 'hx' with x being the title level ranging from 1 to 6.

'h4'
max_chunk_word_count int)

The maximum size a chunk can be (in words). It is a SOFT limit, meaning that chunks bigger that this size will be chunked using lower level headers if any are available."

200
hard_max_chunk_word_count int)

The true maximum size a chunk can be (in word). It is a HARD limit, meaning that chunks bigger by this limit will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks. It should be greater than max_chunk_word_count."

400
min_chunk_word_count int)

The minimum size a chunk can be (in words). Chunks lower than this will be discarded.

15

build_chunks(toc_tree_element, already_ok_chunks=None)

Uses the toc tree to build the chunks. Uses recursion. Method : - build the chunk (= titles from sections above + section content + content of subsections) - if the chunk is too big: - save the section as title + content (if section has content) - subdivide section recursively using subsections - else save it as is

Parameters:

Name Type Description Default
toc_tree_element TocTree

the TocTree for which the chunk should be build

required
already_ok_chunks Chunks

the chunks already built. Used for recursion. Defaults to None.

None

Returns:

Name Type Description
Chunks list[Chunk]

list of chunk's texts

chunk(content)

Chunks a parsed Markdown document.

Parameters:

Name Type Description Default
content MarkdownDoc

the markdown document to chunk. Might be the output of a chunknorris.Parser.

required

Returns:

Type Description
list[Chunk]

list[Chunk]: the chunks

get_chunks(toc_tree)

Wrapper that build the chunk's texts, check that they fit in size, replace links formatting.

Parameters:

Name Type Description Default
toc_tree TocTree

the toc tree of the document

required

Returns:

Name Type Description
Chunks list[Chunk]

the chunks text, formatted

get_parents_headers(toc_tree_element) staticmethod

Gets a list of the titles that are parent of the provided toc tree element. The list is ordered in descending order in terms of header level.

Parameters:

Name Type Description Default
toc_tree_element TocTree

the toc tree element

required

Returns:

Type Description
list[MarkdownLine]

list[MarkdownLine]: the list of line that represent the parent's headers

get_toc_tree(md_lines)

Builds the table of content tree based on header

Parameters:

Name Type Description Default
md_lines list[MarkdownLines]

the markdown lines

required

Returns:

Name Type Description
TocTree TocTree

the table of content

remove_small_chunks(chunks)

Removes chunks that have less words than the specified limit

Parameters:

Name Type Description Default
chunks Chunks

the list of chunks

required

Returns:

Name Type Description
Chunks list[Chunk]

the chunks with more words than the specified threshold

split_big_chunks(chunks)

Splits the chunks that are too big. You may consider passing the kwarg "hard_max_chunk_word_count" to specify the limit size of the chunk (in words)

Parameters:

Name Type Description Default
chunks Chunks

The chunks obtained from the get_chunks() method

required

Returns:

Name Type Description
Chunks list[Chunk]

the chunks, with big chunks splitting into smaller chunks