Chunkers Overview
Overview of the different chunkers available in Chonkie
Chonkie provides multiple chunking strategies to handle different text processing needs. Each chunker in Chonkie is designed to follow the same core principles outlined in the concepts page.
TokenChunker
Splits text into fixed-size token chunks. Best for maintaining consistent chunk sizes and working with token-based models.
SentenceChunker
Splits text at sentence boundaries. Perfect for maintaining semantic completeness at the sentence level.
RecursiveChunker
Recursively chunks documents into smaller chunks. Best for long documents with well-defined structure.
SemanticChunker
Groups content based on semantic similarity. Best for preserving context and topical coherence.
SDPMChunker
Chunks using Semantic Double-Pass Merging (SDPM) algorithm, best for maintaining topical coherence when text has frequent breaks.
LateChunker
Chunks using Late Chunking algorithm, best for higher recall in your RAG applications.
CodeChunker
Splits code based on its structure using ASTs. Ideal for chunking source code files.
NeuralChunker
Uses a fine-tuned BERT model to split text based on semantic shifts. Great for topic-coherent chunks.
SlumberChunker
Agentic chunking using generative models (LLMs) via the Genie interface for S-tier chunk quality. π¦π§
Availability
Different chunkers are available depending on your installation:
Chunker | Default | embeddings | βallβ | Chonkie Cloud |
---|---|---|---|---|
TokenChunker | β | β | β | β |
SentenceChunker | β | β | β | β |
RecursiveChunker | β | β | β | β |
CodeChunker | β | β | β | β |
SemanticChunker | β | β | β | β |
SDPMChunker | β | β | β | β |
LateChunker | β | β | β | β |
NeuralChunker | β | β | β | β |
SlumberChunker | β | β | β | β |
Common Interface
All chunkers share a consistent interface: