Chonkie Documentation

Deprecated as of v1.2.0The SDPM (Semantic Double-Pass Merging) functionality has been integrated into the main SemanticChunker.Recommended Migration:

# Old way (deprecated)
from chonkie.legacy import SDPMChunker
chunker = SDPMChunker(skip_window=1)

# New way (recommended)
from chonkie import SemanticChunker
chunker = SemanticChunker(skip_window=1)

The new SemanticChunker provides all SDPM capabilities plus additional improvements like Savitzky-Golay filtering for better boundary detection.

The SDPMChunker extends semantic chunking by using a double-pass merging approach. It first groups content by semantic similarity, then merges similar groups within a skip window, allowing it to connect related content that may not be consecutive in the text.

Why Use the New SemanticChunker Instead?

The new SemanticChunker includes all SDPM functionality plus:

Better performance: Optimized C extensions for faster processing
Smoother boundaries: Savitzky-Golay filtering for noise reduction
Cleaner API: Simplified parameter names and improved defaults
Active development: Ongoing improvements and bug fixes

Legacy Installation

If you need to use the legacy version for compatibility:

pip install "chonkie[semantic]"

Then import from the legacy module:

from chonkie.legacy import SDPMChunker

Legacy Usage

This documentation is preserved for users who need to maintain existing code using SDPMChunker. For new projects, please use the main SemanticChunker.

Basic Initialization

from chonkie.legacy import SDPMChunker

# Legacy initialization
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.5,                              
    chunk_size=2048,                             
    min_sentences=1,                            
    skip_window=1                               
)

Legacy Parameters

The legacy SDPMChunker uses these parameters (many now renamed in the new SemanticChunker):

embedding_model: Model identifier or embedding instance
mode: “cumulative” or “window” (removed in new version)
threshold: Similarity threshold (0-1) or “auto”
chunk_size: Maximum tokens per chunk
similarity_window: Sentences for threshold calculation
min_sentences: Minimum sentences per chunk (now min_sentences_per_chunk)
min_chunk_size: Minimum tokens per chunk (removed in new version)
min_characters_per_sentence: Minimum characters per sentence
threshold_step: Step size for threshold calculation (removed in new version)
skip_window: Number of chunks to skip when merging

Example Migration

Old Code (Legacy)

from chonkie.legacy import SDPMChunker

chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    mode="window",
    threshold="auto",
    chunk_size=512,
    min_sentences=1,
    min_chunk_size=2,
    skip_window=1
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Sentences: {len(chunk.sentences)}")

New Code (Recommended)

from chonkie import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.7,  # Explicit threshold instead of "auto"
    chunk_size=512,
    min_sentences_per_chunk=1,  # Renamed parameter
    skip_window=1  # Same functionality
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Token count: {chunk.token_count}")

Return Type Changes

Legacy Return Type

The legacy SDPMChunker returns SemanticChunk objects with sentence details:

@dataclass
class SemanticChunk:
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]  # Detailed sentence information

New Return Type

The new SemanticChunker returns simpler Chunk objects:

@dataclass
class Chunk:
    text: str
    start_index: int
    end_index: int
    token_count: int
    # No sentence details - cleaner and more efficient

Full Legacy Documentation

For users who must use the legacy version, the complete original functionality remains available:

from chonkie.legacy import SDPMChunker

# All original parameters still work
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    mode="window",
    threshold="auto",
    chunk_size=2048,
    similarity_window=1,
    min_sentences=1,
    min_chunk_size=2,
    min_characters_per_sentence=12,
    threshold_step=0.01,
    delim=[". ", "! ", "? ", "\n"],
    include_delim="prev",
    skip_window=1
)

# Original methods preserved
chunks = chunker.chunk(text)
batch_chunks = chunker.chunk_batch(texts)

Support

While the legacy SDPMChunker remains available for backward compatibility, it is no longer actively developed. Please consider migrating to the new SemanticChunker for:

Better performance
Active bug fixes
New features
Ongoing support

For migration assistance, see the SemanticChunker documentation or open an issue on our GitHub repository.

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

SDPM Chunker (Legacy)

Why Use the New SemanticChunker Instead?

Legacy Installation

Legacy Usage

Basic Initialization

Legacy Parameters

Example Migration

Old Code (Legacy)

New Code (Recommended)

Return Type Changes

Legacy Return Type

New Return Type

Full Legacy Documentation

Support

Getting Started

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Changelog

​Why Use the New SemanticChunker Instead?

​Legacy Installation

​Legacy Usage

​Basic Initialization

​Legacy Parameters

​Example Migration

​Old Code (Legacy)

​New Code (Recommended)

​Return Type Changes

​Legacy Return Type

​New Return Type

​Full Legacy Documentation

​Support

Why Use the New SemanticChunker Instead?

Legacy Installation

Legacy Usage

Basic Initialization

Legacy Parameters

Example Migration

Old Code (Legacy)

New Code (Recommended)

Return Type Changes

Legacy Return Type

New Return Type

Full Legacy Documentation

Support