Skip to content

Reference for HTMLParser

Bases: AbstractParser[str]

apply_markdownify(html_string) staticmethod

Applies markdownify to the HTML string, iterating until the output stabilises (handles nested HTML structures such as tables within tables).

Parameters:

Name Type Description Default
html_string str

an HTML-formatted string.

required

Returns:

Name Type Description
str str

the markdownified string.

cleanup_string(md_string) staticmethod

Cleans up the markdownified string.

Parameters:

Name Type Description Default
md_string str

the markdown string, output from apply_markdownify().

required

Returns:

Name Type Description
str str

the cleaned up string.

parse_file(filepath)

Reads and parses an HTML file. Ensures that the formatting is suited to be passed to the MarkdownChunker.

Parameters:

Name Type Description Default
filepath str

the path to a .html or .htm file

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the parsed document. Can be fed to chunker.

parse_string(string)

Parses an HTML-formatted string. Ensures that the formatting is suited to be passed to the MarkdownChunker.

Parameters:

Name Type Description Default
string str

the HTML formatted string

required

Returns:

Name Type Description
MarkdownDoc MarkdownDoc

the parsed document. Can be fed to chunker.

read_file(filepath) staticmethod

Reads an HTML file.

Parameters:

Name Type Description Default
filepath str

the path to the HTML file.

required

Returns:

Name Type Description
str str

the HTML string.