Skip to content

Reference for Chunk

The Chunk is the entity returned by chunknorris's chunkers. It contains various elements related to the chunks : it's text content, headers, the pages it comes from (if from paginated documents) etc. You might essentially need to use Chunk.get_text() to get the cleaned chunk's content as text preceded by its headers.

Bases: BaseModel

word_count: int property

Gets the amount of words in the chunk's content (headers not included)

get_text(remove_links=False, prepend_headers=True)

Gets the text of the chunk.

Parameters:

Name Type Description Default
remove_links bool

If True, the markdown links will be removed (text of the link is kept). Defaults to False.

False

Returns:

Name Type Description
str str

the text

Removes the markdown format of the links in the text.

Parameters:

Name Type Description Default
text str

the text to find the links in

required

Returns:

Name Type Description
str str

the formated text