Implement a custom Parser (NotebookParser)¶
This tutorial is designed to provide you with additional tools for utilizing chunknorris
in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements.
In this tutorial, we will focus on how to implement a custom parser.
⚠️ Important note: Following the implementation of this tutorial, the JupyterNotebookParser
has been implemented. It's a more robust implementation than what's presented here, so if your aim is to parse jupyter notebooks, it's advisable to use the JupyterNotebookParser
.
Goal¶
In this tutorial let's consider you want to implement a custom Notebook parser.
As we still want to leverage the ability of chunknorris
to chunk efficiently, we must implement a parser that can be plugged into the MarkdownChunker
through a pipeline. The MarkdownChunker
takes as input a MarkdownDoc
object, our parser has to output the markdown content in that format.
# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.core.components import MarkdownDoc # <-- object ot be fed in chunker
Starting point¶
We start by importing the AbstractParser
. Every parser in chunknorris must inherit from it. This class only need you to implement two method, which will enable your parser to fit well with the chunknorris
' pipelines :
- chunk_string(string : str) to parse a string.
- chunk_file(filepath : str) to parse a file given a filepath.
Both must return a MarkdownDoc
object.
# Base of our class
class NotebookParser(AbstractParser): # inherit from abstract parser
def parse_file(self, filepath: str) -> MarkdownDoc:
pass
def parse_string(self, string: str) -> MarkdownDoc:
pass # We have to fill this
Add functionnality¶
Let's add some functionnality to read and parse the file !
We will implement 2 methods :
read_file()
to read the fileparse_notebook_content()
that parses the "markdown" and "code" cells of the notebook.
Much more parsing work could be done but we will limite to this for the tutorial. Let's have a look at our NotebookParser
class now:
class NotebookParser(AbstractParser): # inherit from abstract parser
def __init__(self, include_code_cells_outputs: bool = False) -> None:
self.include_code_cells_outputs = include_code_cells_outputs
def parse_file(self, filepath: str) -> MarkdownDoc:
"""chunks a notebook .ipynb file"""
file_content = self.read_file(filepath)
md_string = self.parse_notebook_content(file_content)
return MarkdownDoc.from_string(md_string) # we don't return directly the markdown string, but build a MarkdownDoc with
def parse_string(self, string: str) -> MarkdownDoc:
raise NotImplementedError # We won't implement this as it is unlikely that the notebook content fill be passed as a string.
@staticmethod
def read_file(filepath: str) -> dict[str, Any]:
"""Reads a .ipynb file and returns its
content as a json dict.
Args:
filepath (str): path to the file
Returns:
dict[str, Any]: the json content of the ipynb file
"""
if not filepath.endswith(".ipynb"):
raise ValueError("Only .ipynb files can be passed to NotebookParser.")
with open(filepath, "r", encoding="utf8") as file:
content = json.load(file)
return content
def parse_notebook_content(self, notebook_content: dict[str, Any]) -> str:
"""Parses
Args:
notebook_content (dict[str, Any]): the content of the notebook, as a json file.
It should be a dict of structure:
{'cells': [{
'cell_type': 'markdown',
'metadata': {},
'source': <list of lines>
}...]
Returns:
str: the markdown string parsed from the notebook content
"""
kernel_language = notebook_content["metadata"]["kernelspec"]["language"]\
if notebook_content["metadata"] else ""
md_string = ""
for cell in notebook_content["cells"]:
match cell["cell_type"]:
case "markdown" | "raw":
md_string += "".join(cell["source"]) + "\n\n"
case "code":
language = cell["metadata"]["kernelspec"]["language"] if cell["metadata"] else kernel_language
md_string += "```" + language + "\n" + "".join(cell['source']) + "\n```\n\n"
if self.include_code_cells_outputs:
md_string += "".join(cell["outputs"].get("data", {}).get('text/plain', "")) + "\n\n"
case _:
pass
return md_string
-
}...]
Returns:
str: the markdown string parsed from the notebook content
"""
kernel_language = notebook_content["metadata"]["kernelspec"]["language"]\
if notebook_content["metadata"] else ""
md_string = ""
for cell in notebook_content["cells"]:
match cell["cell_type"]:
case "markdown" | "raw":
md_string += "".join(cell["source"]) + "\n\n"
case "code":
language = cell["metadata"]["kernelspec"]["language"] if cell["metadata"] else kernel_language
md_string += "```" + language + "\n" + "".join(cell['source']) + "\n```\n\n"
if self.include_code_cells_outputs:
md_string += "".join(cell["outputs"].get("data", {}).get('text/plain', "")) + "\n\n"
case _:
pass
return md_string
Use our parser to get chunks¶
Now the parser is ready, let's use it !
path_to_notebook = "./custom_parser.ipynb" # as an example we will use... this notebook !
notebook_parser = NotebookParser(include_code_cells_outputs=False)
md_doc = notebook_parser.parse_file(path_to_notebook)
# Before feeding the parsed result to the chunker, **let's have a look** at the markdown it outputs.
Markdown(md_doc.to_string()[:1400] + " [...]") # only print out the first 1400 caracters
Implement a custom Parser (NotebookParser)¶
This tutorial is designed to provide you with additional tools for utilizing chunknorris
in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements.
In this tutorial, we will focus on how to implement a custom parser.
⚠️ Important note: Following the implementation of this tutorial, the JupyterNotebookParser
has been implemented. It's a more robust implementation than what's presented here, so if your aim is to parse jupyter notebooks, it's advisable to use the JupyterNotebookParser
.
Goal¶
In this tutorial let's consider you want to implement a custom Notebook parser.
As we still want to leverage the ability of chunknorris
to chunk efficiently, we must implement a parser that can be plugged into the MarkdownChunker
through a pipeline. The MarkdownChunker
takes as input a MarkdownDoc
object, our parser has to output the markdown content in that format.
# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.parsers.markdown.components import MarkdownDoc # <-- object ot be fed in chunker
Starting point¶
We start by importing the AbstractParser
[...]
That parsed result looks great ! Now let's chunk it !
You can directly feed the MarkdownDoc
to the MarkdownChunker.chunk()
method. But I would suggest to use the BasePipeline to do this as it enables extra functionnality for saving the chunks.
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline
pipe = BasePipeline(
parser=NotebookParser(),
chunker=MarkdownChunker(max_chunk_word_count=100)
)
chunks = pipe.chunk_file(path_to_notebook)
print(f"Got {len(chunks)} chunks !")
for i, chunk in enumerate(chunks[:3]):
print(f"============ chunk {i} ============")
print(chunk)
2024-12-20 10:11:ChunkNorris:INFO:Function "chunk" took 0.0014 seconds
Got 6 chunks ! ============ chunk 0 ============ # Implement a custom Parser (NotebookParser) This tutorial is designed to provide you with additional tools for utilizing ``chunknorris`` in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements. In this tutorial, we will focus on how to implement a **custom parser**. ⚠️ **Important note**: Following the implementation of this tutorial, the ``JupyterNotebookParser`` has been implemented. It's a more robust implementation than what's presented here, so **if your aim is to parse jupyter notebooks, it's advisable to use the ``JupyterNotebookParser``**. ============ chunk 1 ============ # Implement a custom Parser (NotebookParser) ## Goal In this tutorial let's consider you want to implement a **custom Notebook parser**. As we still want to leverage the ability of ``chunknorris`` to chunk efficiently, we must implement a parser that can be plugged into the ``MarkdownChunker`` through a pipeline. The ``MarkdownChunker`` takes as input a ``MarkdownDoc`` object, our parser has to output the markdown content in that format. ```python # Import components from typing import Any import json from IPython.display import Markdown from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this from chunknorris.parsers.markdown.components import MarkdownDoc # <-- object ot be fed in chunker ``` ============ chunk 2 ============ # Implement a custom Parser (NotebookParser) ## Starting point We start by importing the ``AbstractParser``. Every parser in chunknorris must inherit from it. This class only need you to implement two method, which will enable your parser to fit well with the ``chunknorris``' pipelines : - chunk_string(string : str) to parse a string. - chunk_file(filepath : str) to parse a file given a filepath. Both must return a ``MarkdownDoc`` object. ```python # Base of our class class NotebookParser(AbstractParser): # inherit from abstract parser def parse_file(self, filepath: str) -> MarkdownDoc: pass def parse_string(self, string: str) -> MarkdownDoc: pass # We have to fill this ```
Conclusion¶
There you go ! Take note the chunknorris
will always try to preserve the intergrity of code blocks.
One last tip: if you wish to customize the behavior of one specific parser instead (HTMLParser for example), you might want to inherit directly from that parser instead of starting from stratch with AbstractParser.