Skip to content

Reference for ExcelParser

The ExcelParser enables parsing spreadsheets, such as .xslx files. All sheets in the notebook will be parsed.

Bases: AbstractSheetParser[bytes]

Parser for spreadsheets, such as Excel workbooks (.xslx). For a list of handled filetypes, refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

__init__(output_format='auto')

Initializes an Excel parser

Parameters:

Name Type Description Default
output_format Literal["markdown_table", "json_lines", "auto]

the output format of the parsed document. - markdown_table : uses tabula to build a markdown-formatted table. - json_lines : each row of the table will be output as a JSON line. Better for chunking as headers are preserved. - auto : will detect which format is the more suitable. CSV-like sheet will be converset to JSON lines. Defaults to "auto".

'auto'

convert_sheets_to_output_format(sheets)

Handle the conversion of the sheets obtained from pandas.read_excel() method to the specified output format.

Parameters:

Name Type Description Default
sheets dict[str, DataFrame]

the sheets returned from pd.read_excel(sheet_name=None).

required

Returns:

Name Type Description
str str

the formatted string

parse_file(filepath)

Parses a excel-like file to markdown.

Parameters:

Name Type Description Default
filepath str

the path to the excel-like file.

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the markdown formatted excel file.

parse_string(string)

Parses a bytes string representing an excel file.

Parameters:

Name Type Description Default
string bytes

the excel as a byte string.

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the markdown formatted excel file

read_file(filepath)

Read the provided filepath.

Parameters:

Name Type Description Default
filepath str

path to the file.

required

Returns:

Type Description
dict[str, DataFrame]

dict[str, pd.DataFrame]: a mapping containing {sheet_name: corresponding_dataframe.}