Meet the SlumberChunker – Chonkie’s first agentic chunker! This isn’t your average chunker; it uses the reasoning power of large generative models (LLMs) to understand your text deeply and create truly S-tier chunks.

API Reference

To use the SlumberChunker via the API, check out the API reference documentation.

Introducing Genie! 🧞

The magic behind SlumberChunker is Genie, Chonkie’s new interface for integrating generative models and APIs (like Gemini, OpenAI, Anthropic, etc.). Genie allows SlumberChunker to intelligently analyze text structure, identify optimal split points, and even summarize or rephrase content for the best possible chunk quality.

Requires [genie] Install

To unleash the power of SlumberChunker and Genie, you need the [genie] optional install. This includes the necessary libraries to connect to various generative model APIs.

pip install "chonkie[genie]"

Installation

As mentioned, SlumberChunker requires the [genie] optional install:

pip install "chonkie[genie]"
For general installation instructions, see the Installation Guide.

Initialization

from chonkie import SlumberChunker
from chonkie.genie import AutoGenie # Example: Using AutoGenie to load a model

# Initialize Genie (e.g., configure for Gemini)
# Make sure you have the necessary API keys set as environment variables!
genie = AutoGenie.get_genie("gemini/gemini-1.5-flash")

# Basic initialization
chunker = SlumberChunker(
    genie=genie,
    tokenizer_or_token_counter="gpt2",
    chunk_size=1024,
    candidate_size=128, # How many tokens Genie looks at for potential splits
    min_characters_per_chunk=24,
    verbose=True # See what Genie is thinking!
)

# You can also rely on default Genie setup if configured globally
# chunker = SlumberChunker() # Uses default Genie if available

Parameters

genie
Optional[BaseGenie]
default:"None"

An instance of a Genie interface (e.g., GeminiGenie, OpenAIGenie). If None, tries to load a default Genie configuration. Required for operation.

tokenizer_or_token_counter
Union[str, Callable, Any]
default:"gpt2"

Tokenizer or token counting function used for initial splitting and size estimation.

chunk_size
int
default:"1024"

The target maximum number of tokens per chunk. Genie will try to adhere to this.

rules
RecursiveRules
default:"RecursiveRules()"

Initial recursive rules used to generate candidate split points before Genie refines them. See RecursiveChunker for details.

candidate_size
int
default:"128"

The number of tokens around a potential split point that Genie examines to make its decision.

min_characters_per_chunk
int
default:"24"

Minimum number of characters required for a chunk to be considered valid.

return_type
Literal['chunks', 'texts']
default:"chunks"

Whether to return chunks as SlumberChunk objects or plain text strings.

verbose
bool
default:"True"

If True, prints detailed information about Genie’s decision-making process during chunking. Useful for debugging!

Usage

Single Text Chunking

text = """Complex document with interwoven ideas. Section 1 introduces concept A.
Section 2 discusses concept B, but references A frequently.
Section 3 concludes by merging A and B. Traditional chunkers might struggle here."""

# Assuming 'chunker' is initialized as shown above
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    # SlumberChunk might have additional metadata from Genie

Batch Chunking

texts = [
    "First document requiring nuanced splitting...",
    "Second document where agentic understanding helps..."
]
batch_chunks = chunker.chunk_batch(texts) # Note: Batch processing might be slower due to LLM calls

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Using as a Callable

# Single text
chunks = chunker("Let Genie decide the best way to CHONK this...")

# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])

Return Type

SlumberChunker returns chunks as Chunk objects, potentially with extra metadata attached depending on the configuration and Genie’s output.

from dataclasses import dataclass
from typing import Optional

# Definition similar to TokenChunker's return type
@dataclass
class Context:
    text: str
    token_count: int
    start_index: Optional[int] = None
    end_index: Optional[int] = None

@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk
    context: Optional[Context] = None # Contextual information if any