Skip to content

Reference for PdfPipeline

Compared to the BasePipeline, the PdfPipeline handles extra functionnality specific to .pdf document. For example, it will switch between chunking using headers or chunking by page depending on whether of not header have been found or if the .pdf is derived from Powerpoint.

This pipeline is meant to be used on for pdf files. First, it uses a parser to parse the pdf to markdown format. Second, it build chunks based on headers if any have been found, or pages if not.

chunk_by_page()

Build one chunk per page.

Returns:

Name Type Description
Chunks list[Chunk]

the list of chunks.

chunk_file(filepath, page_start=0, page_end=None)

Chunks a PDF file. It first convert the pdf file to markdown using the PdfParser, and then chunks the markdown file.

Parameters:

Name Type Description Default
filepath str

the path to the pdf file to chunk

required
page_start int

the page to start parsing from. Defaults to 0.

0
page_end int

the page to stop parsing. None to parse until last page. Defaults to None.

None

Returns:

Name Type Description
Chunks list[Chunk]

the chunks

chunk_string(string, page_start=0, page_end=None)

Chunks a PDF byte stream. It first convert the string to markdown using the PdfParser, and then chunks the markdown file.

Parameters:

Name Type Description Default
filepath str

the path to the pdf file to chunk

required
page_start int

the page to start parsing from. Defaults to 0.

0
page_end int

the page to stop parsing. None to parse until last page. Defaults to None.

None

Returns:

Name Type Description
Chunks list[Chunk]

the chunks

chunk_with_headers()

Uses the MarkdownChunker to chunk the document based on headers.

Returns:

Name Type Description
Chunks list[Chunk]

the list of chunks

headers_have_been_found()

Determines whether or not enough header have been found in order to chunk document using MarkdownChunker. If more than half the headers have been found, returns True.

Returns:

Name Type Description
bool bool

True if detected headers have been found in document.

save_chunks(chunks, output_filename, remove_links=False)

Saves the chunks at the designated location as a json list of chunks.

Parameters:

Name Type Description Default
chunks Chunks

the chunks.

required
output_filename str

the JSON file where to save the files. Must be json.

required
remove_links bool

Whether or not links should be remove from the chunk's text content.

False