Deprecated as of v1.2.0The SDPM (Semantic Double-Pass Merging) functionality has been integrated into the main SemanticChunker.Recommended Migration:
# Old way (deprecated)
from chonkie.legacy import SDPMChunker
chunker = SDPMChunker(skip_window=1)

# New way (recommended)
from chonkie import SemanticChunker
chunker = SemanticChunker(skip_window=1)
The new SemanticChunker provides all SDPM capabilities plus additional improvements like Savitzky-Golay filtering for better boundary detection.
The SDPMChunker extends semantic chunking by using a double-pass merging approach. It first groups content by semantic similarity, then merges similar groups within a skip window, allowing it to connect related content that may not be consecutive in the text.

Why Use the New SemanticChunker Instead?

The new SemanticChunker includes all SDPM functionality plus:
  • Better performance: Optimized C extensions for faster processing
  • Smoother boundaries: Savitzky-Golay filtering for noise reduction
  • Cleaner API: Simplified parameter names and improved defaults
  • Active development: Ongoing improvements and bug fixes

Legacy Installation

If you need to use the legacy version for compatibility:
pip install "chonkie[semantic]"
Then import from the legacy module:
from chonkie.legacy import SDPMChunker

Legacy Usage

This documentation is preserved for users who need to maintain existing code using SDPMChunker. For new projects, please use the main SemanticChunker.

Basic Initialization

from chonkie.legacy import SDPMChunker

# Legacy initialization
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.5,                              
    chunk_size=2048,                             
    min_sentences=1,                            
    skip_window=1                               
)

Legacy Parameters

The legacy SDPMChunker uses these parameters (many now renamed in the new SemanticChunker):
  • embedding_model: Model identifier or embedding instance
  • mode: “cumulative” or “window” (removed in new version)
  • threshold: Similarity threshold (0-1) or “auto”
  • chunk_size: Maximum tokens per chunk
  • similarity_window: Sentences for threshold calculation
  • min_sentences: Minimum sentences per chunk (now min_sentences_per_chunk)
  • min_chunk_size: Minimum tokens per chunk (removed in new version)
  • min_characters_per_sentence: Minimum characters per sentence
  • threshold_step: Step size for threshold calculation (removed in new version)
  • skip_window: Number of chunks to skip when merging

Example Migration

Old Code (Legacy)

from chonkie.legacy import SDPMChunker

chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    mode="window",
    threshold="auto",
    chunk_size=512,
    min_sentences=1,
    min_chunk_size=2,
    skip_window=1
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Sentences: {len(chunk.sentences)}")
from chonkie import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.7,  # Explicit threshold instead of "auto"
    chunk_size=512,
    min_sentences_per_chunk=1,  # Renamed parameter
    skip_window=1  # Same functionality
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Token count: {chunk.token_count}")

Return Type Changes

Legacy Return Type

The legacy SDPMChunker returns SemanticChunk objects with sentence details:
@dataclass
class SemanticChunk:
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]  # Detailed sentence information

New Return Type

The new SemanticChunker returns simpler Chunk objects:
@dataclass
class Chunk:
    text: str
    start_index: int
    end_index: int
    token_count: int
    # No sentence details - cleaner and more efficient

Full Legacy Documentation

For users who must use the legacy version, the complete original functionality remains available:
from chonkie.legacy import SDPMChunker

# All original parameters still work
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    mode="window",
    threshold="auto",
    chunk_size=2048,
    similarity_window=1,
    min_sentences=1,
    min_chunk_size=2,
    min_characters_per_sentence=12,
    threshold_step=0.01,
    delim=[". ", "! ", "? ", "\n"],
    include_delim="prev",
    skip_window=1
)

# Original methods preserved
chunks = chunker.chunk(text)
batch_chunks = chunker.chunk_batch(texts)

Support

While the legacy SDPMChunker remains available for backward compatibility, it is no longer actively developed. Please consider migrating to the new SemanticChunker for:
  • Better performance
  • Active bug fixes
  • New features
  • Ongoing support
For migration assistance, see the SemanticChunker documentation or open an issue on our GitHub repository.