Chonkie Documentation

The NeuralChunker leverages the power of deep learning! It uses a fine-tuned BERT model specifically trained to identify semantic shifts within text, allowing it to split documents at points where the topic or context changes significantly. This provides highly coherent chunks ideal for RAG.

API Reference

To use the NeuralChunker via the API, check out the API reference documentation.

Installation

NeuralChunker requires specific dependencies for its deep learning model. You can install it with:

pip install "chonkie[neural]"

For general installation instructions, see the Installation Guide.

Initialization

from chonkie import NeuralChunker

# Basic initialization with default parameters
chunker = NeuralChunker(
    model="mirth/chonky_modernbert_base_1",  # Default model
    device_map="cpu",                        # Device to run the model on ('cpu', 'cuda', etc.)
    min_characters_per_chunk=10,             # Minimum characters for a chunk
)

# Specify a different model or device
chunker = NeuralChunker(
    model="path/to/your/model",
    device_map="cuda:0" # Use GPU if available
)

Parameters

model

str

default:"mirth/chonky_modernbert_base_1"

The identifier or path to the fine-tuned BERT model used for detecting semantic shifts.

device_map

str

default:"cpu"

The device to run the inference on (e.g., “cpu”, “cuda”, “mps”). Chonkie will try to auto-detect the best available device if not specified.

min_characters_per_chunk

int

default:"10"

The minimum number of characters required for a text segment to be considered a valid chunk.

Usage

Single Text Chunking

text = """Topic one starts here and continues for a bit.
Suddenly, the context shifts to topic two, which is quite different.
Topic two carries on, discussing various aspects. Then topic one briefly returns.
Finally, we conclude with topic three."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}") # Note: token_count might be added post-hoc or not available depending on implementation
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Batch Chunking

texts = [
    "Document 1 discussing AI ethics. Then shifts to model training techniques.",
    "Document 2 about pygmy hippos. Their habitat and diet. Then conservation efforts."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Using as a Callable

# Single text
chunks = chunker("Text discussing topic A... then topic B...")

# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])

Return Type

NeuralChunker returns chunks as Chunk objects.

from dataclasses import dataclass
from typing import Optional

# Definition similar to TokenChunker's return type
@dataclass
class Context:
    text: str
    token_count: int
    start_index: Optional[int] = None
    end_index: Optional[int] = None

@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk (calculated based on the tokenizer used)
    context: Optional[Context] = None # Contextual information (if any)

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

Neural Chunker

API Reference

Installation

Initialization

Parameters

Usage

Single Text Chunking

Batch Chunking

Using as a Callable

Return Type

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

​API Reference

​Installation

​Initialization

​Parameters

​Usage

​Single Text Chunking

​Batch Chunking

​Using as a Callable

​Return Type

API Reference

Installation

Initialization

Parameters

Usage

Single Text Chunking

Batch Chunking

Using as a Callable

Return Type