Reference for PdfPipeline
Compared to the BasePipeline
, the PdfPipeline
handles extra functionnality specific to .pdf document. For example, it will switch between chunking using headers or chunking by page depending on whether of not header have been found or if the .pdf is derived from Powerpoint.
This pipeline is meant to be used on for pdf files. First, it uses a parser to parse the pdf to markdown format. Second, it build chunks based on headers if any have been found, or pages if not.
chunk_by_page()
chunk_file(filepath, page_start=0, page_end=None)
Chunks a PDF file. It first convert the pdf file to markdown using the PdfParser, and then chunks the markdown file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
the path to the pdf file to chunk |
required |
page_start
|
int
|
the page to start parsing from. Defaults to 0. |
0
|
page_end
|
int
|
the page to stop parsing. None to parse until last page. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the chunks |
chunk_string(string, page_start=0, page_end=None)
Chunks a PDF byte stream. It first convert the string to markdown using the PdfParser, and then chunks the markdown file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
the path to the pdf file to chunk |
required |
page_start
|
int
|
the page to start parsing from. Defaults to 0. |
0
|
page_end
|
int
|
the page to stop parsing. None to parse until last page. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the chunks |
chunk_with_headers()
Uses the MarkdownChunker to chunk the document based on headers.
Returns:
Name | Type | Description |
---|---|---|
Chunks |
list[Chunk]
|
the list of chunks |
headers_have_been_found()
Determines whether or not enough header have been found in order to chunk document using MarkdownChunker. If more than half the headers have been found, returns True.
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if detected headers have been found in document. |
save_chunks(chunks, output_filename, remove_links=False)
Saves the chunks at the designated location as a json list of chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
Chunks
|
the chunks. |
required |
output_filename
|
str
|
the JSON file where to save the files. Must be json. |
required |
remove_links
|
bool
|
Whether or not links should be remove from the chunk's text content. |
False
|