Reference for MarkdownDoc
The MarkdownDoc is the entity returned by chunknorris's parsers. It's purpose it mainly to be fed to the MarkdownChunker.
Bases: BaseModel
A parsed Markdown Formatted-String, resulting in a list of MarkdownLine. Feats : - ATX header formatting. - Remove base64 images
Show JSON schema:
{
"$defs": {
"MarkdownLine": {
"properties": {
"text": {
"description": "the text content of the line",
"title": "Text",
"type": "string"
},
"line_idx": {
"description": "the index of the line in the markdown string",
"title": "Line Idx",
"type": "integer"
},
"isin_code_block": {
"description": "whether or not the line belongs to a code block",
"title": "Isin Code Block",
"type": "boolean"
},
"page": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"description": "the page the line belongs to (if markdown comes from converted paginated document)",
"title": "Page"
}
},
"required": [
"text",
"line_idx",
"isin_code_block",
"page"
],
"title": "MarkdownLine",
"type": "object"
}
},
"description": "A parsed Markdown Formatted-String,\nresulting in a list of MarkdownLine.\nFeats :\n- ATX header formatting.\n- Remove base64 images",
"properties": {
"content": {
"items": {
"$ref": "#/$defs/MarkdownLine"
},
"title": "Content",
"type": "array"
},
"metadata": {
"additionalProperties": true,
"default": {},
"title": "Metadata",
"type": "object"
}
},
"required": [
"content"
],
"title": "MarkdownDoc",
"type": "object"
}
Config:
arbitrary_types_allowed:True
Fields:
-
content(list[MarkdownLine]) -
metadata(dict[str, Any])
from_string(md_string)
staticmethod
Get the MardownDoc object from a markdown formatted string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
md_string
|
str
|
the markdown string |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MarkdownDoc |
MarkdownDoc
|
the markdown document |
to_string()
Get the markdown string corresponding to the document's content