Skip to content

Reference for Chunk

The Chunk is the entity returned by chunknorris's chunkers. It contains various elements related to the chunks : it's text content, headers, the pages it comes from (if from paginated documents) etc. You might essentially need to use Chunk.get_text() to get the cleaned chunk's content as text preceded by its headers.

Bases: BaseModel

Show JSON schema:
{
  "$defs": {
    "MarkdownLine": {
      "properties": {
        "text": {
          "description": "the text content of the line",
          "title": "Text",
          "type": "string"
        },
        "line_idx": {
          "description": "the index of the line in the markdown string",
          "title": "Line Idx",
          "type": "integer"
        },
        "isin_code_block": {
          "description": "whether or not the line belongs to a code block",
          "title": "Isin Code Block",
          "type": "boolean"
        },
        "page": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "description": "the page the line belongs to (if markdown comes from converted paginated document)",
          "title": "Page"
        }
      },
      "required": [
        "text",
        "line_idx",
        "isin_code_block",
        "page"
      ],
      "title": "MarkdownLine",
      "type": "object"
    }
  },
  "properties": {
    "headers": {
      "items": {
        "$ref": "#/$defs/MarkdownLine"
      },
      "title": "Headers",
      "type": "array"
    },
    "content": {
      "items": {
        "$ref": "#/$defs/MarkdownLine"
      },
      "title": "Content",
      "type": "array"
    },
    "start_line": {
      "title": "Start Line",
      "type": "integer"
    }
  },
  "required": [
    "headers",
    "content",
    "start_line"
  ],
  "title": "Chunk",
  "type": "object"
}

Config:

  • arbitrary_types_allowed: True

Fields:

word_count property

Gets the amount of words in the chunk's content (headers not included)

get_text(remove_links=False, prepend_headers=True)

Gets the text of the chunk.

Parameters:

Name Type Description Default
remove_links bool

If True, the markdown links will be removed (text of the link is kept). Defaults to False.

False

Returns:

Name Type Description
str str

the text

Removes the markdown format of the links in the text.

Parameters:

Name Type Description Default
text str

the text to find the links in

required

Returns:

Name Type Description
str str

the formated text