Reference for `PdfPipeline`

Compared to the BasePipeline, the PdfPipeline handles extra functionnality specific to .pdf document. For example, it will switch between chunking using headers or chunking by page depending on whether of not header have been found or if the .pdf is derived from Powerpoint.

This pipeline is meant to be used on for pdf files. First, it uses a parser to parse the pdf to markdown format. Second, it build chunks based on headers if any have been found, or pages if not.

`chunk_by_page()`

Build one chunk per page.

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the list of chunks.

`chunk_file(filepath, page_start=0, page_end=None)`

Chunks a PDF file. It first convert the pdf file to markdown using the PdfParser, and then chunks the markdown file.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	the path to the pdf file to chunk	required
`page_start`	`int`	the page to start parsing from. Defaults to 0.	`0`
`page_end`	`int`	the page to stop parsing. None to parse until last page. Defaults to None.	`None`

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the chunks

`chunk_string(string, page_start=0, page_end=None)`

Chunks a PDF byte stream. It first convert the string to markdown using the PdfParser, and then chunks the markdown file.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	the path to the pdf file to chunk	required
`page_start`	`int`	the page to start parsing from. Defaults to 0.	`0`
`page_end`	`int`	the page to stop parsing. None to parse until last page. Defaults to None.	`None`

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the chunks

`chunk_with_headers()`

Uses the MarkdownChunker to chunk the document based on headers.

Returns:

Name	Type	Description
`Chunks`	`list[Chunk]`	the list of chunks

`headers_have_been_found()`

Determines whether or not enough header have been found in order to chunk document using MarkdownChunker. If more than half the headers have been found, returns True.

Returns:

Name	Type	Description
`bool`	`bool`	True if detected headers have been found in document.

`save_chunks(chunks, output_filename, remove_links=False)`

Saves the chunks at the designated location as a json list of chunks.

Parameters:

Name	Type	Description	Default
`chunks`	`Chunks`	the chunks.	required
`output_filename`	`str`	the JSON file where to save the files. Must be json.	required
`remove_links`	`bool`	Whether or not links should be remove from the chunk's text content.	`False`

Reference for PdfPipeline

chunk_by_page()

chunk_file(filepath, page_start=0, page_end=None)

chunk_string(string, page_start=0, page_end=None)

chunk_with_headers()

headers_have_been_found()

save_chunks(chunks, output_filename, remove_links=False)

Reference for `PdfPipeline`

`chunk_by_page()`

`chunk_file(filepath, page_start=0, page_end=None)`

`chunk_string(string, page_start=0, page_end=None)`

`chunk_with_headers()`

`headers_have_been_found()`

`save_chunks(chunks, output_filename, remove_links=False)`