Reference for CSVParser
The CSVParser is dedicated to the parsing of comma-separated Value file (.csv). By default it will attempt to infer the delimiter used (comma, semicolon, ...). Otherwise you may specify the delimiter it should use.
Bases: AbstractParser
Parser for Comma-Separated Values file (.csv)
__init__(csv_delimiter=None, output_format='json_lines')
Initializes a sheet parser
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_delimiter
|
str | None
|
The delimiter to consider to parse the .csv files. If None, we will try to guess what the delimiter is. Defaults to None. |
None
|
output_format
|
Literal["markdown_table", "json_lines"]
|
the output format of the parsed document. - markdown_table : uses tabula to build a markdown-formatted table. - json_lines : each row of the table will be output as a JSON line. NOTE : consumes way more tokens as column names are repeated at each row. But easier to read for LLMs. Defaults to "json_lines". |
'json_lines'
|
convert_df_to_json_lines(df)
staticmethod
Converts a DataFrame to json lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
the dataframe to convert. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the json lines. |
convert_df_to_markdown_table(df)
staticmethod
Converts a DataFrame to markdown. Wraps tabula's method pd.DataFrame.to_markdown() between pre and post processing. Preprocess : - Remove in text columns PostProcess : - Replace multiple spaces with 2 spaces.
Args:
df (pd.DataFrame): the dataframe to convert.
Returns:
str: a markdown formatted table.
parse_file(filepath)
Parses a csv file to markdown.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
the path to the csv file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MarkdownDoc |
MarkdownDoc
|
the markdown-formatted csv. |
parse_string(string)
Parses a string representing a csv file to markdown.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string
|
str
|
the csv-formatted string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MarkdownDoc |
MarkdownDoc
|
the markdown-formatted csv. |
read_file(filepath)
Read the provided filepath. For a list of handled filetypes, refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
path to the file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the csv file content as a string. |