Skip to content

Reference for PdfParser

The PdfParser's functionnalities are splitted among various classes. In the end, PdfParser wraps those functionnalities into a document parsing pipeline.

Bases: PdfLinkExtraction, PdfTableExtraction, PdfTocExtraction, PdfPlotter, PdfExport, DocSpecsExtraction, PdfParserState

Class that parses the document.

__init__(*, extract_tables=True, table_finder=TableFinder(), add_headers=True, use_ocr='auto', ocr_language='fra+eng', body_line_spacing=None)

Initializes a pdf parser.

Parameters:

Name Type Description Default
extract_tables bool

whether or not tables should be extracted. Defaults to True.

True
add_headers bool

if True, the parser will try to find a table of content. either in documents or in metadata and style the headers accordingly. Defaults to True.

True
use_ocr str

whether or not OCR should be used. Allows to detect text on images but keep in mind that this might include text you actually do not want, such as screenshots. Must be one of ["always", "auto", "never"]. Default to "auto".

'auto'
ocr_language str, optional)

the languages to consider for OCR. Must be a string of 3 letter codes languages separated by "+". Example : "fra+eng+ita"

'fra+eng'
body_line_spacing float, optional)

the size of the space between 2 lines of the body of the document. Generally around 1. If None, an automatic method will try to find it. Tweak this parameter for better merging of lines into blocks.

None
table_finder TableFinder | None

the table finder to use for parsing the tables. If None, defauts to a TableFinder with default parameters.

TableFinder()

check_ocr_config_is_valid()

Check that the OCR configuration is valid.

cleanup_memory()

Cleans up memory by reseting all objects created to parse the document.

parse_file(filepath, page_start=0, page_end=None)

Parses a pdf document and returns the parsed MarkdownDoc object.

Parameters:

Name Type Description Default
filepath str

the path to the file to parse.

required
page_start int

the page to start parsing from. Defaults to 0.

0
page_end int

the page to stop parsing. None to parse until last page. Defaults to None.

None

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

The MarkdownDoc to be passed to MarkdownChunker.

parse_string(string, page_start=0, page_end=None)

Parses a byte string obtained from a pdf document and returns its corresponding Markdown formatted string.

Parameters:

Name Type Description Default
string bytes

a bytes stream.

required
page_start int

the page to start parsing from. Defaults to 0.

0
page_end int

the page to stop parsing. None to parse until last page. Defaults to None.

None

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

The MarkdownDoc to be passed to MarkdownChunker.