The NeuralChunker leverages the power of deep learning! It uses a fine-tuned BERT model specifically trained to identify semantic shifts within text, allowing it to split documents at points where the topic or context changes significantly. This provides highly coherent chunks ideal for RAG.

API Reference

To use the NeuralChunker via the API, check out the API reference documentation.

Installation

NeuralChunker requires specific dependencies for its deep learning model. You can install it with:

pip install "chonkie[neural]"
For general installation instructions, see the Installation Guide.

Initialization

from chonkie import NeuralChunker

# Basic initialization with default parameters
chunker = NeuralChunker(
    model="mirth/chonky_modernbert_base_1", # Default model
    device="cpu",                            # Device to run the model on ('cpu', 'cuda', etc.)
    min_characters_per_chunk=10,             # Minimum characters for a chunk
    return_type="chunks"                     # Output type
)

# Specify a different model or device
chunker = NeuralChunker(
    model="path/to/your/model",
    device="cuda:0" # Use GPU if available
)

Parameters

model
str
default:"mirth/chonky_modernbert_base_1"

The identifier or path to the fine-tuned BERT model used for detecting semantic shifts.

device
str
default:"cpu"

The device to run the inference on (e.g., “cpu”, “cuda”, “mps”). Chonkie will try to auto-detect the best available device if not specified.

min_characters_per_chunk
int
default:"10"

The minimum number of characters required for a text segment to be considered a valid chunk.

return_type
Literal['chunks', 'texts']
default:"chunks"

Whether to return chunks as NeuralChunk objects or plain text strings.

Usage

Single Text Chunking

text = """Topic one starts here and continues for a bit.
Suddenly, the context shifts to topic two, which is quite different.
Topic two carries on, discussing various aspects. Then topic one briefly returns.
Finally, we conclude with topic three."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}") # Note: token_count might be added post-hoc or not available depending on implementation
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Batch Chunking

texts = [
    "Document 1 discussing AI ethics. Then shifts to model training techniques.",
    "Document 2 about pygmy hippos. Their habitat and diet. Then conservation efforts."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Using as a Callable

# Single text
chunks = chunker("Text discussing topic A... then topic B...")

# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])

Return Type

NeuralChunker returns chunks as Chunk objects.

from dataclasses import dataclass
from typing import Optional

# Definition similar to TokenChunker's return type
@dataclass
class Context:
    text: str
    token_count: int
    start_index: Optional[int] = None
    end_index: Optional[int] = None

@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk (calculated based on the tokenizer used)
    context: Optional[Context] = None # Contextual information (if any)