LateChunker
Split text into chunks based on a late-bound token count
The LateChunker implements the late chunking strategy described in the Late Chunking paper. It builds on top of the RecursiveChunker
and uses document-level embeddings to create more semantically rich chunk representations.
Instead of generating embeddings for each chunk independently, the LateChunker first encodes the entire text into a single embedding. It then splits the text using recursive rules and derives each chunk’s embedding by averaging relevant parts of the full document embedding. This allows each chunk to carry broader contextual information, improving retrieval performance in RAG systems.
API Reference
To use the LateChunker
via the API, check out the API reference documentation.
Installation
LateChunker requires the sentence-transformers
library to be installed, and currently only supports SentenceTransformer models.
You can install it with:
The LateChunker uses RecursiveRules
to determine how to chunk the text.
The rules are a list of RecursiveLevel
objects, which define the delimiters and whitespace rules for each level of the recursive tree.
Find more information about the rules in the Additional Information section.
Initialization
You can also initialize the LateChunker using a recipe. Recipes are pre-defined rules for common chunking tasks. Find all available recipes on our Hugging Face Hub here.
Parameters
SentenceTransformer model to use for embedding
Maximum number of tokens per chunk
Rules to use for chunking
Minimum number of characters per sentence
Usage
Single Text Chunking
Batch Chunking
Return Type
LateChunker returns LateChunk
objects with optimized storage using slots:
Additional Information
LateChunker uses the RecursiveRules
class to determine the chunking rules.
The rules are a list of RecursiveLevel
objects, which define the delimiters and whitespace rules for each level of the recursive tree.
You can pass in custom rules to the LateChunker, or use the default ones. Default rules are designed to be a good starting point for most documents, but you can customize them to your needs.
RecursiveLevel
expects the list of custom delimiters to not include whitespace.
If whitespace as a delimiter is required, you can set the whitespace
parameter in the RecursiveLevel
class to True.
Note that if whitespace = True
, you cannot pass a list of custom delimiters.