Reference for PdfParser
The PdfParser
's functionnalities are splitted among various classes. In the end, PdfParser
wraps those functionnalities into a document parsing pipeline.
Bases: PdfLinkExtraction
, PdfTableExtraction
, PdfTocExtraction
, PdfPlotter
, PdfExport
, DocSpecsExtraction
, PdfParserState
Class that parses the document.
__init__(*, extract_tables=True, table_finder=TableFinder(), add_headers=True, use_ocr='auto', ocr_language='fra+eng', body_line_spacing=None)
Initializes a pdf parser.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extract_tables
|
bool
|
whether or not tables should be extracted. Defaults to True. |
True
|
add_headers
|
bool
|
if True, the parser will try to find a table of content. either in documents or in metadata and style the headers accordingly. Defaults to True. |
True
|
use_ocr
|
str
|
whether or not OCR should be used. Allows to detect text on images but keep in mind that this might include text you actually do not want, such as screenshots. Must be one of ["always", "auto", "never"]. Default to "auto". |
'auto'
|
ocr_language
|
str, optional)
|
the languages to consider for OCR. Must be a string of 3 letter codes languages separated by "+". Example : "fra+eng+ita" |
'fra+eng'
|
body_line_spacing
|
float, optional)
|
the size of the space between 2 lines of the body of the document. Generally around 1. If None, an automatic method will try to find it. Tweak this parameter for better merging of lines into blocks. |
None
|
table_finder
|
TableFinder | None
|
the table finder to use for parsing the tables. If None, defauts to a TableFinder with default parameters. |
TableFinder()
|
check_ocr_config_is_valid()
Check that the OCR configuration is valid.
cleanup_memory()
Cleans up memory by reseting all objects created to parse the document.
parse_file(filepath, page_start=0, page_end=None)
Parses a pdf document and returns the parsed MarkdownDoc object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
the path to the file to parse. |
required |
page_start
|
int
|
the page to start parsing from. Defaults to 0. |
0
|
page_end
|
int
|
the page to stop parsing. None to parse until last page. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
MarkdownDoc |
MarkdownDoc
|
The MarkdownDoc to be passed to MarkdownChunker. |
parse_string(string, page_start=0, page_end=None)
Parses a byte string obtained from a pdf document and returns its corresponding Markdown formatted string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
string
|
bytes
|
a bytes stream. |
required |
page_start
|
int
|
the page to start parsing from. Defaults to 0. |
0
|
page_end
|
int
|
the page to stop parsing. None to parse until last page. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
MarkdownDoc |
MarkdownDoc
|
The MarkdownDoc to be passed to MarkdownChunker. |