Reference for HTMLParser
Bases: AbstractParser[str]
apply_markdownify(html_string)
staticmethod
Applies markdownify to the HTML string, iterating until the output stabilises (handles nested HTML structures such as tables within tables).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html_string
|
str
|
an HTML-formatted string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the markdownified string. |
cleanup_string(md_string)
staticmethod
Cleans up the markdownified string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
md_string
|
str
|
the markdown string, output from apply_markdownify(). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the cleaned up string. |
parse_file(filepath)
Reads and parses an HTML file. Ensures that the formatting is suited to be passed to the MarkdownChunker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
the path to a .html or .htm file |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MarkdownDoc |
MarkdownDoc
|
the parsed document. Can be fed to chunker. |
parse_string(string)
Parses an HTML-formatted string. Ensures that the formatting is suited to be passed to the MarkdownChunker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string
|
str
|
the HTML formatted string |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MarkdownDoc |
MarkdownDoc
|
the parsed document. Can be fed to chunker. |
read_file(filepath)
staticmethod
Reads an HTML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
the path to the HTML file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
the HTML string. |