Chonkie Documentation

The LateChunker implements the late chunking strategy described in the Late Chunking paper. It builds on top of the RecursiveChunker and uses document-level embeddings to create more semantically rich chunk representations. Instead of generating embeddings for each chunk independently, the LateChunker first encodes the entire text into a single embedding. It then splits the text using recursive rules and derives each chunk’s embedding by averaging relevant parts of the full document embedding. This allows each chunk to carry broader contextual information, improving retrieval performance in RAG systems.

API Reference

To use the LateChunker via the API, check out the API reference documentation.

Installation

LateChunker requires the sentence-transformers library to be installed, and currently only supports SentenceTransformer models. You can install it with: The LateChunker uses RecursiveRules to determine how to chunk the text. The rules are a list of RecursiveLevel objects, which define the delimiters and whitespace rules for each level of the recursive tree. Find more information about the rules in the Additional Information section.

pip install "chonkie[st]"

For installation instructions, see the Installation Guide.

Initialization

from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="all-MiniLM-L6-v2",
    chunk_size=2048,
    rules=RecursiveRules(),
    min_characters_per_chunk=24,
)

You can also initialize the LateChunker using a recipe. Recipes are pre-defined rules for common chunking tasks. Find all available recipes on our Hugging Face Hub here.

from chonkie import LateChunker

# Initialize the late chunker to chunk Markdown
chunker = LateChunker.from_recipe("markdown", lang="en")

# Initialize the late chunker to chunk Hindi texts
chunker = LateChunker.from_recipe(lang="hi")

Parameters

embedding_model

str

default:"all-MiniLM-L6-v2"

SentenceTransformer model to use for embedding

chunk_size

int

default:"2048"

Maximum number of tokens per chunk

rules

RecursiveRules

default:"RecursiveRules()"

Rules to use for chunking

min_characters_per_chunk

int

default:"24"

Minimum number of characters per sentence

Usage

Single Text Chunking

text = """First paragraph about a specific topic.
Second paragraph continuing the same topic.
Third paragraph switching to a different topic.
Fourth paragraph expanding on the new topic."""

chunks = chunker(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Batch Chunking

texts = [
    "First document about topic A...",
    "Second document about topic B..."
]

batch_chunks = chunker(texts)

for chunk in batch_chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Return Type

LateChunker returns LateChunk objects with optimized storage using slots:

@dataclass
class LateChunk(SentenceChunk):
    text: str  # The text of the chunk.
    start_index: int  # The start index of the chunk.
    end_index: int  # The end index of the chunk.
    token_count: int  # The number of tokens in the chunk.
    start_token: int  # The start token of the chunk.
    end_token: int  # The end token of the chunk.
    sentences: list[LateSentence]  # The sentences in the chunk.
    embedding: Optional[np.ndarray]  # The embedding of the chunk.

Additional Information

LateChunker uses the RecursiveRules class to determine the chunking rules. The rules are a list of RecursiveLevel objects, which define the delimiters and whitespace rules for each level of the recursive tree.

@dataclass
class RecursiveRules:
    rules: List[RecursiveLevel]

@dataclass
class RecursiveLevel:
    delimiters: Union[None, str, List[str]]
    whitespace: bool = False
    include_delim: Optional[Literal["prev", "next"]]  # Whether to include the delimiter at all, or in the previous chunk, or the next chunk.

You can pass in custom rules to the LateChunker, or use the default ones. Default rules are designed to be a good starting point for most documents, but you can customize them to your needs.

RecursiveLevel expects the list of custom delimiters to not include whitespace. If whitespace as a delimiter is required, you can set the whitespace parameter in the RecursiveLevel class to True. Note that if whitespace = True, you cannot pass a list of custom delimiters.

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

Late Chunker

API Reference

Installation

Initialization

Parameters

Usage

Single Text Chunking

Batch Chunking

Return Type

Additional Information

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

​API Reference

​Installation

​Initialization

​Parameters

​Usage

​Single Text Chunking

​Batch Chunking

​Return Type

​Additional Information

API Reference

Installation

Initialization

Parameters

Usage

Single Text Chunking

Batch Chunking

Return Type

Additional Information