Reference for ExcelParser
The ExcelParser
enables parsing spreadsheets, such as .xslx files. All sheets in the notebook will be parsed.
Bases: AbstractParser
Parser for spreadsheets, such as Excel workbooks (.xslx). For a list of handled filetypes, refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
__init__(output_format='auto')
Initializes an Excel parser
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_format
|
Literal["markdown_table", "json_lines", "auto]
|
the output format of the parsed document. - markdown_table : uses tabula to build a markdown-formatted table. - json_lines : each row of the table will be output as a JSON line. Better for chunking as headers are preserved. - auto : will detect which format is the more suitable. CSV-like sheet will be converset to JSON lines. Defaults to "auto". |
'auto'
|
convert_df_to_json_lines(df)
staticmethod
Converts a DataFrame to json lines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
the dataframe to convert. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
the json lines. |
convert_df_to_markdown_table(df)
staticmethod
Converts a DataFrame to markdown. Wraps tabula's method pd.DataFrame.to_markdown() between pre and post processing. Preprocess : - Remove in text columns PostProcess : - Replace multiple spaces with 2 spaces.
Args:
df (pd.DataFrame): the dataframe to convert.
Returns:
str: a markdown formatted table.
convert_sheets_to_output_format(sheets)
Handle the conversion of the sheets obtained from pandas.read_excel() method to the specified output format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sheets
|
dict[str, DataFrame]
|
the sheets returned from pd.read_excel(sheet_name=None). |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
the formatted string |
parse_file(filepath)
Parses a excel-like file to markdown.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
the path to the excel-like file. |
required |
Returns:
Name | Type | Description |
---|---|---|
MarkdownDoc |
MarkdownDoc
|
the markdown formatted excel file. |
parse_string(string)
Parses a bytes string representing an excel file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
string
|
bytes
|
the excel as a byte string. |
required |
Returns:
Name | Type | Description |
---|---|---|
MarkdownDoc |
MarkdownDoc
|
the markdown formatted excel file |
read_file(filepath)
Read the provided filepath.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
path to the file. |
required |
Returns:
Type | Description |
---|---|
dict[str, DataFrame]
|
dict[str, pd.DataFrame]: a mapping containing {sheet_name: corresponding_dataframe.} |