The NeuralChunker
leverages the power of deep learning! It uses a fine-tuned BERT model specifically trained to identify semantic shifts within text, allowing it to split documents at points where the topic or context changes significantly. This provides highly coherent chunks ideal for RAG.
API Reference
To use the NeuralChunker
via the API, check out the API reference documentation.
Installation
NeuralChunker requires specific dependencies for its deep learning model. You can install it with:
pip install "chonkie[neural]"
Initialization
from chonkie import NeuralChunker
# Basic initialization with default parameters
chunker = NeuralChunker(
model="mirth/chonky_modernbert_base_1", # Default model
device_map="cpu", # Device to run the model on ('cpu', 'cuda', etc.)
min_characters_per_chunk=10, # Minimum characters for a chunk
return_type="chunks" # Output type
)
# Specify a different model or device
chunker = NeuralChunker(
model="path/to/your/model",
device_map="cuda:0" # Use GPU if available
)
Parameters
model
str
default:"mirth/chonky_modernbert_base_1"
The identifier or path to the fine-tuned BERT model used for detecting semantic shifts.
The device to run the inference on (e.g., “cpu”, “cuda”, “mps”). Chonkie will try to auto-detect the best available device if not specified.
The minimum number of characters required for a text segment to be considered a valid chunk.
return_type
Literal['chunks', 'texts']
default:"chunks"
Whether to return chunks as NeuralChunk
objects or plain text strings.
Usage
Single Text Chunking
text = """Topic one starts here and continues for a bit.
Suddenly, the context shifts to topic two, which is quite different.
Topic two carries on, discussing various aspects. Then topic one briefly returns.
Finally, we conclude with topic three."""
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}") # Note: token_count might be added post-hoc or not available depending on implementation
print(f"Start index: {chunk.start_index}")
print(f"End index: {chunk.end_index}")
Batch Chunking
texts = [
"Document 1 discussing AI ethics. Then shifts to model training techniques.",
"Document 2 about pygmy hippos. Their habitat and diet. Then conservation efforts."
]
batch_chunks = chunker.chunk_batch(texts)
for doc_chunks in batch_chunks:
for chunk in doc_chunks:
print(f"Chunk: {chunk.text}")
Using as a Callable
# Single text
chunks = chunker("Text discussing topic A... then topic B...")
# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])
Return Type
NeuralChunker returns chunks as Chunk
objects.
from dataclasses import dataclass
from typing import Optional
# Definition similar to TokenChunker's return type
@dataclass
class Context:
text: str
token_count: int
start_index: Optional[int] = None
end_index: Optional[int] = None
@dataclass
class Chunk:
text: str # The chunk text
start_index: int # Starting position in original text
end_index: int # Ending position in original text
token_count: int # Number of tokens in chunk (calculated based on the tokenizer used)
context: Optional[Context] = None # Contextual information (if any)