Implement a custom Parser (NotebookParser)¶

This tutorial is designed to provide you with additional tools for utilizing chunknorris in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements.

In this tutorial, we will focus on how to implement a custom parser.

⚠️ Important note: Following the implementation of this tutorial, the JupyterNotebookParser has been implemented. It's a more robust implementation than what's presented here, so if your aim is to parse jupyter notebooks, it's advisable to use the JupyterNotebookParser.

Goal¶

In this tutorial let's consider you want to implement a custom Notebook parser.

As we still want to leverage the ability of chunknorris to chunk efficiently, we must implement a parser that can be plugged into the MarkdownChunker through a pipeline. The MarkdownChunker takes as input a MarkdownDoc object, our parser has to output the markdown content in that format.

In [ ]:

Copied!





# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.core.components import MarkdownDoc # <-- object ot be fed in chunker
# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.core.components import MarkdownDoc # <-- object ot be fed in chunker

Starting point¶

We start by importing the AbstractParser. Every parser in chunknorris must inherit from it. This class only need you to implement two method, which will enable your parser to fit well with the chunknorris' pipelines :

chunk_string(string : str) to parse a string.
chunk_file(filepath : str) to parse a file given a filepath.

Both must return a MarkdownDoc object.

In [2]:

Copied!





# Base of our class
class NotebookParser(AbstractParser): # inherit from abstract parser
    def parse_file(self, filepath: str) -> MarkdownDoc:
        pass

    def parse_string(self, string: str) -> MarkdownDoc:
        pass # We have to fill this
# Base of our class
class NotebookParser(AbstractParser): # inherit from abstract parser
    def parse_file(self, filepath: str) -> MarkdownDoc:
        pass

    def parse_string(self, string: str) -> MarkdownDoc:
        pass # We have to fill this

Add functionnality¶

Let's add some functionnality to read and parse the file !

We will implement 2 methods :

read_file() to read the file
parse_notebook_content() that parses the "markdown" and "code" cells of the notebook.

Much more parsing work could be done but we will limite to this for the tutorial. Let's have a look at our NotebookParser class now:

In [3]:

Copied!





class NotebookParser(AbstractParser): # inherit from abstract parser
    def __init__(self, include_code_cells_outputs: bool = False) -> None:
        self.include_code_cells_outputs = include_code_cells_outputs

    def parse_file(self, filepath: str) -> MarkdownDoc:
        """chunks a notebook .ipynb file"""
        file_content = self.read_file(filepath)
        md_string = self.parse_notebook_content(file_content)
        
        return MarkdownDoc.from_string(md_string) # we don't return directly the markdown string, but build a MarkdownDoc with

    def parse_string(self, string: str) -> MarkdownDoc:
        raise NotImplementedError # We won't implement this as it is unlikely that the notebook content fill be passed as a string.

    @staticmethod
    def read_file(filepath: str) -> dict[str, Any]:
        """Reads a .ipynb file and returns its 
        content as a json dict.

        Args:
            filepath (str): path to the file

        Returns:
            dict[str, Any]: the json content of the ipynb file
        """
        if not filepath.endswith(".ipynb"):
            raise ValueError("Only .ipynb files can be passed to NotebookParser.")
        with open(filepath, "r", encoding="utf8") as file:
            content = json.load(file)

        return content
    
    def parse_notebook_content(self, notebook_content: dict[str, Any]) -> str:
        """Parses

        Args:
            notebook_content (dict[str, Any]): the content of the notebook, as a json file.
                It should be a dict of structure:
                {'cells': [{
                    'cell_type': 'markdown',
                    'metadata': {},
                    'source': <list of lines>
                    }...]

        Returns:
            str: the markdown string parsed from the notebook content
        """
        kernel_language = notebook_content["metadata"]["kernelspec"]["language"]\
            if notebook_content["metadata"] else ""
        md_string = ""
        for cell in notebook_content["cells"]:
            match cell["cell_type"]:
                case "markdown" | "raw":
                    md_string += "".join(cell["source"]) + "\n\n"
                case "code":
                    language = cell["metadata"]["kernelspec"]["language"] if cell["metadata"] else kernel_language
                    md_string += "```" + language + "\n" + "".join(cell['source']) + "\n```\n\n"
                    if self.include_code_cells_outputs:
                        md_string += "".join(cell["outputs"].get("data", {}).get('text/plain', "")) + "\n\n"
                case _:
                    pass

        return md_string
class NotebookParser(AbstractParser): # inherit from abstract parser
    def __init__(self, include_code_cells_outputs: bool = False) -> None:
        self.include_code_cells_outputs = include_code_cells_outputs

    def parse_file(self, filepath: str) -> MarkdownDoc:
        """chunks a notebook .ipynb file"""
        file_content = self.read_file(filepath)
        md_string = self.parse_notebook_content(file_content)
        
        return MarkdownDoc.from_string(md_string) # we don't return directly the markdown string, but build a MarkdownDoc with

    def parse_string(self, string: str) -> MarkdownDoc:
        raise NotImplementedError # We won't implement this as it is unlikely that the notebook content fill be passed as a string.

    @staticmethod
    def read_file(filepath: str) -> dict[str, Any]:
        """Reads a .ipynb file and returns its 
        content as a json dict.

        Args:
            filepath (str): path to the file

        Returns:
            dict[str, Any]: the json content of the ipynb file
        """
        if not filepath.endswith(".ipynb"):
            raise ValueError("Only .ipynb files can be passed to NotebookParser.")
        with open(filepath, "r", encoding="utf8") as file:
            content = json.load(file)

        return content
    
    def parse_notebook_content(self, notebook_content: dict[str, Any]) -> str:
        """Parses

        Args:
            notebook_content (dict[str, Any]): the content of the notebook, as a json file.
                It should be a dict of structure:
                {'cells': [{
                    'cell_type': 'markdown',
                    'metadata': {},
                    'source': 
                    }...]

        Returns:
            str: the markdown string parsed from the notebook content
        """
        kernel_language = notebook_content["metadata"]["kernelspec"]["language"]\
            if notebook_content["metadata"] else ""
        md_string = ""
        for cell in notebook_content["cells"]:
            match cell["cell_type"]:
                case "markdown" | "raw":
                    md_string += "".join(cell["source"]) + "\n\n"
                case "code":
                    language = cell["metadata"]["kernelspec"]["language"] if cell["metadata"] else kernel_language
                    md_string += "```" + language + "\n" + "".join(cell['source']) + "\n```\n\n"
                    if self.include_code_cells_outputs:
                        md_string += "".join(cell["outputs"].get("data", {}).get('text/plain', "")) + "\n\n"
                case _:
                    pass

        return md_string

Use our parser to get chunks¶

Now the parser is ready, let's use it !

In [4]:

Copied!

path_to_notebook = "./custom_parser.ipynb" # as an example we will use... this notebook !
notebook_parser = NotebookParser(include_code_cells_outputs=False)
md_doc = notebook_parser.parse_file(path_to_notebook)
path_to_notebook = "./custom_parser.ipynb" # as an example we will use... this notebook !
notebook_parser = NotebookParser(include_code_cells_outputs=False)
md_doc = notebook_parser.parse_file(path_to_notebook)

In [5]:

Copied!

# Before feeding the parsed result to the chunker, **let's have a look** at the markdown it outputs.
Markdown(md_doc.to_string()[:1400] + " [...]") # only print out the first 1400 caracters
# Before feeding the parsed result to the chunker, **let's have a look** at the markdown it outputs.
Markdown(md_doc.to_string()[:1400] + " [...]") # only print out the first 1400 caracters

Out[5]:

Implement a custom Parser (NotebookParser)¶

This tutorial is designed to provide you with additional tools for utilizing chunknorris in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements.

In this tutorial, we will focus on how to implement a custom parser.

⚠️ Important note: Following the implementation of this tutorial, the JupyterNotebookParser has been implemented. It's a more robust implementation than what's presented here, so if your aim is to parse jupyter notebooks, it's advisable to use the JupyterNotebookParser.

Goal¶

In this tutorial let's consider you want to implement a custom Notebook parser.

As we still want to leverage the ability of chunknorris to chunk efficiently, we must implement a parser that can be plugged into the MarkdownChunker through a pipeline. The MarkdownChunker takes as input a MarkdownDoc object, our parser has to output the markdown content in that format.

# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.parsers.markdown.components import MarkdownDoc # <-- object ot be fed in chunker

Starting point¶

We start by importing the AbstractParser [...]

That parsed result looks great ! Now let's chunk it !

You can directly feed the MarkdownDoc to the MarkdownChunker.chunk() method. But I would suggest to use the BasePipeline to do this as it enables extra functionnality for saving the chunks.

In [6]:

Copied!

from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline

In [7]:

Copied!





pipe = BasePipeline(
    parser=NotebookParser(), 
    chunker=MarkdownChunker(max_chunk_word_count=100)
    )

chunks = pipe.chunk_file(path_to_notebook)
print(f"Got {len(chunks)} chunks !")
for i, chunk in enumerate(chunks[:3]):
    print(f"============ chunk {i} ============")
    print(chunk)
pipe = BasePipeline(
    parser=NotebookParser(), 
    chunker=MarkdownChunker(max_chunk_word_count=100)
    )

chunks = pipe.chunk_file(path_to_notebook)
print(f"Got {len(chunks)} chunks !")
for i, chunk in enumerate(chunks[:3]):
    print(f"============ chunk {i} ============")
    print(chunk)

2024-12-20 10:11:ChunkNorris:INFO:Function "chunk" took 0.0014 seconds

Got 6 chunks !
============ chunk 0 ============
# Implement a custom Parser (NotebookParser)

This tutorial is designed to provide you with additional tools for utilizing ``chunknorris`` in your specific applications. All components, including the Parser, Chunker, and Pipelines, can be tailored to meet your requirements.

In this tutorial, we will focus on how to implement a **custom parser**.

⚠️ **Important note**: Following the implementation of this tutorial, the ``JupyterNotebookParser`` has been implemented. It's a more robust implementation than what's presented here, so **if your aim is to parse jupyter notebooks, it's advisable to use the ``JupyterNotebookParser``**.
============ chunk 1 ============
# Implement a custom Parser (NotebookParser)

## Goal

In this tutorial let's consider you want to implement a **custom Notebook parser**.

As we still want to leverage the ability of ``chunknorris`` to chunk efficiently, we must implement a parser that can be plugged into the ``MarkdownChunker`` through a pipeline. The ``MarkdownChunker`` takes as input a ``MarkdownDoc`` object, our parser has to output the markdown content in that format.

```python
# Import components
from typing import Any
import json
from IPython.display import Markdown
from chunknorris.parsers import AbstractParser # <-- our custom parser must inherit from this
from chunknorris.parsers.markdown.components import MarkdownDoc # <-- object ot be fed in chunker
```
============ chunk 2 ============
# Implement a custom Parser (NotebookParser)

## Starting point

We start by importing the ``AbstractParser``. Every parser in chunknorris must inherit from it. This class only need you to implement two method, which will enable your parser to fit well with the ``chunknorris``' pipelines :

- chunk_string(string : str) to parse a string.
- chunk_file(filepath : str) to parse a file given a filepath.

Both must return a ``MarkdownDoc`` object.

```python
# Base of our class
class NotebookParser(AbstractParser): # inherit from abstract parser
def parse_file(self, filepath: str) -> MarkdownDoc:
pass

def parse_string(self, string: str) -> MarkdownDoc:
pass # We have to fill this
```

Conclusion¶

There you go ! Take note the chunknorris will always try to preserve the intergrity of code blocks.

One last tip: if you wish to customize the behavior of one specific parser instead (HTMLParser for example), you might want to inherit directly from that parser instead of starting from stratch with AbstractParser.