Welcome to ChunkNorris' documentation !
What is chunknorris
?
In a nutshell, chunknorris
is a python package that aims at drastically improve the chunking of documents from various sources (HTML, PDFs, Markdown, ...) while keeping the usage of computational ressources to the minimum. Try it out !
Why should I use it ?
In the context of Retrieval Augmented Generation (RAG), an optimized chunking strategy leads to :
- Better relevancy of chunks and thus easier identification of useful chunks through more expressive embeddings.
- Less hallucinations of generation models because of superfluous information in the prompt
- Less errors because of chunks exceeding the API limits in terms of number of tokens
- Reduced cost as the prompt can have reduced size
As of today, many packages exist with the intent of parsing documents. Though the vast majority of them :
- rely on high computational requirements
- do not provide chunks out of the box, and instead provide parsing of the documents on top of which the user has to build the chunking implementation.