Reference for Chunk
The Chunk is the entity returned by chunknorris's chunkers. It contains various elements related to the chunks : it's text content, headers, the pages it comes from (if from paginated documents) etc.
You might essentially need to use Chunk.get_text() to get the cleaned chunk's content as text preceded by its headers.
Bases: BaseModel
Show JSON schema:
{
"$defs": {
"MarkdownLine": {
"properties": {
"text": {
"description": "the text content of the line",
"title": "Text",
"type": "string"
},
"line_idx": {
"description": "the index of the line in the markdown string",
"title": "Line Idx",
"type": "integer"
},
"isin_code_block": {
"description": "whether or not the line belongs to a code block",
"title": "Isin Code Block",
"type": "boolean"
},
"page": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"description": "the page the line belongs to (if markdown comes from converted paginated document)",
"title": "Page"
}
},
"required": [
"text",
"line_idx",
"isin_code_block",
"page"
],
"title": "MarkdownLine",
"type": "object"
}
},
"properties": {
"headers": {
"items": {
"$ref": "#/$defs/MarkdownLine"
},
"title": "Headers",
"type": "array"
},
"content": {
"items": {
"$ref": "#/$defs/MarkdownLine"
},
"title": "Content",
"type": "array"
},
"start_line": {
"title": "Start Line",
"type": "integer"
}
},
"required": [
"headers",
"content",
"start_line"
],
"title": "Chunk",
"type": "object"
}
Config:
arbitrary_types_allowed:True
Fields:
-
headers(list[MarkdownLine]) -
content(list[MarkdownLine]) -
start_line(int)
word_count
property
Gets the amount of words in the chunk's content (headers not included)
get_text(remove_links=False, prepend_headers=True)
Gets the text of the chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_links
|
bool
|
If True, the markdown links will be removed (text of the link is kept). Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the text |
remove_links(text)
staticmethod
Removes the markdown format of the links in the text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
the text to find the links in |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the formated text |