The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This chunker now includes advanced features like Savitzky-Golay filtering for smoother boundary detection and skip-window merging for connecting related content that may not be consecutive. This chunker is inspired by the work of Greg Kamradt.

API Reference

To use the SemanticChunker via the API, check out the API reference documentation.

Installation

SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:
pip install "chonkie[semantic]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.8,                               # Similarity threshold (0-1)
    chunk_size=2048,                             # Maximum tokens per chunk
    similarity_window=3,                         # Window for similarity calculation
    skip_window=0                                # Skip-and-merge window (0=disabled)
)

# With skip-and-merge enabled (similar to legacy SDPM behavior)
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.7,
    chunk_size=2048,
    skip_window=1  # Enable merging of similar non-consecutive groups
)

Parameters

embedding_model
Union[str, BaseEmbeddings]
default:"minishlab/potion-base-8M"
Model identifier or embedding model instance
threshold
float
default:"0.8"
Similarity threshold for grouping sentences (0-1). Lower values create larger groups.
chunk_size
int
default:"2048"
Maximum tokens per chunk
similarity_window
int
default:"3"
Number of sentences to consider for similarity calculation
min_sentences_per_chunk
int
default:"1"
Minimum number of sentences per chunk
min_characters_per_sentence
int
default:"24"
Minimum number of characters per sentence
skip_window
int
default:"0"
Number of groups to skip when looking for similar content to merge.
  • 0 (default): No skip-and-merge, uses standard semantic grouping
  • 1 or higher: Enables merging of semantically similar groups within the skip window
This feature allows the chunker to connect related content that may not be consecutive in the text.
filter_window
int
default:"5"
Window length for the Savitzky-Golay filter used in boundary detection
filter_polyorder
int
default:"3"
Polynomial order for the Savitzky-Golay filter
filter_tolerance
float
default:"0.2"
Tolerance for the Savitzky-Golay filter boundary detection
delim
Union[str, List[str]]
default:"[\". \", \"! \", \"? \", \"\\n\"]"
Delimiters to split sentences on
include_delim
Optional[Literal["prev", "next"]]
default:"prev"
Include delimiters in the chunk text. Specify whether to include with the previous or next sentence.

Basic Usage

from chonkie import SemanticChunker

# Initialize with semantic similarity grouping
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.7,  # Similarity threshold
    chunk_size=512
)

text = """Your document text with multiple topics and themes..."""
chunks = chunker.chunk(text)

# Process chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text[:50]}...")
    print(f"Tokens: {chunk.token_count}")

Examples

Advanced Features

Savitzky-Golay Filtering

The SemanticChunker uses Savitzky-Golay filtering for smoother boundary detection in similarity curves. This reduces noise in the semantic similarity signal and provides more stable chunk boundaries.

Skip-Window Merging

When skip_window > 0, the chunker can merge semantically similar groups that are not consecutive. This is useful for:
  • Documents with alternating topics
  • Content with recurring themes
  • Technical documents with distributed related sections

Supported Embeddings

SemanticChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.

Return Type

SemanticChunker returns Chunk objects:
@dataclass
class Chunk:
    text: str
    start_index: int
    end_index: int
    token_count: int

Legacy Versions

The original SemanticChunker (pre-v1.2.0) with different parameter names and behavior is available in the legacy module:
from chonkie.legacy import SemanticChunker as LegacySemanticChunker

# Uses old parameter names and behavior
chunker = LegacySemanticChunker(
    mode="window",
    min_sentences=1,
    min_chunk_size=2,
    threshold_step=0.01
)
For the SDPM (Semantic Double-Pass Merging) functionality, you can either:
  1. Use the new SemanticChunker with skip_window > 0 (recommended)
  2. Use the legacy SDPMChunker (see SDPM Chunker documentation)