Skip to content

Reference for ExcelParser

The ExcelParser enables parsing spreadsheets, such as .xslx files. All sheets in the notebook will be parsed.

Bases: AbstractParser

Parser for spreadsheets, such as Excel workbooks (.xslx). For a list of handled filetypes, refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

__init__(output_format='auto')

Initializes an Excel parser

Parameters:

Name Type Description Default
output_format Literal["markdown_table", "json_lines", "auto]

the output format of the parsed document. - markdown_table : uses tabula to build a markdown-formatted table. - json_lines : each row of the table will be output as a JSON line. Better for chunking as headers are preserved. - auto : will detect which format is the more suitable. CSV-like sheet will be converset to JSON lines. Defaults to "auto".

'auto'

convert_df_to_json_lines(df) staticmethod

Converts a DataFrame to json lines.

Parameters:

Name Type Description Default
df DataFrame

the dataframe to convert.

required

Returns:

Name Type Description
str str

the json lines.

convert_df_to_markdown_table(df) staticmethod

Converts a DataFrame to markdown. Wraps tabula's method pd.DataFrame.to_markdown() between pre and post processing. Preprocess : - Remove in text columns PostProcess : - Replace multiple spaces with 2 spaces.

   Args:
       df (pd.DataFrame): the dataframe to convert.

   Returns:
       str: a markdown formatted table.

convert_sheets_to_output_format(sheets)

Handle the conversion of the sheets obtained from pandas.read_excel() method to the specified output format.

Parameters:

Name Type Description Default
sheets dict[str, DataFrame]

the sheets returned from pd.read_excel(sheet_name=None).

required

Returns:

Name Type Description
str str

the formatted string

parse_file(filepath)

Parses a excel-like file to markdown.

Parameters:

Name Type Description Default
filepath str

the path to the excel-like file.

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the markdown formatted excel file.

parse_string(string)

Parses a bytes string representing an excel file.

Parameters:

Name Type Description Default
string bytes

the excel as a byte string.

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the markdown formatted excel file

read_file(filepath)

Read the provided filepath.

Parameters:

Name Type Description Default
filepath str

path to the file.

required

Returns:

Type Description
dict[str, DataFrame]

dict[str, pd.DataFrame]: a mapping containing {sheet_name: corresponding_dataframe.}