Reference for Chunk
The Chunk
is the entity returned by chunknorris
's chunkers. It contains various elements related to the chunks : it's text content, headers, the pages it comes from (if from paginated documents) etc.
You might essentially need to use Chunk.get_text()
to get the cleaned chunk's content as text preceded by its headers.
Bases: BaseModel
word_count: int
property
Gets the amount of words in the chunk's content (headers not included)
get_text(remove_links=False, prepend_headers=True)
Gets the text of the chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
remove_links
|
bool
|
If True, the markdown links will be removed (text of the link is kept). Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
the text |
remove_links(text)
staticmethod
Removes the markdown format of the links in the text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
the text to find the links in |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
the formated text |