Python
from chonkie.cloud import SDPMChunker 

chunker = SDPMChunker(api_key="{api_key}") 

chunks = chunker(text="YOUR_TEXT")
[
  {
    "text": "<string>",
    "start_index": 123,
    "end_index": 123,
    "token_count": 123,
    "sentences": [
      {
        "text": "<string>",
        "start_index": 123,
        "end_index": 123,
        "token_count": 123,
        "embedding": [
          123
        ]
      }
    ]
  }
]

Authorizations

Authorization
string
header
required

Your API Key from the Chonkie Cloud dashboard

Body

multipart/form-data
file
file

The file to chunk.

embedding_model
string
default:minishlab/potion-base-32M

Model identifier or embedding model instance to use for semantic analysis.

threshold
default:auto

Similarity threshold for grouping sentences. Can be a float [0,1] for direct threshold, int (1,100] for percentile, or 'auto' for automatic calculation.

Allowed value: "auto"
mode
string
default:window

Mode for grouping sentences, either 'cumulative' or 'window'.

chunk_size
integer
default:2048

Maximum tokens per chunk.

similarity_window
integer
default:1

Number of preceding sentences to consider for similarity comparison.

min_sentences
integer
default:1

Minimum number of sentences per chunk.

min_characters_per_sentence
integer
default:12

Minimum number of characters per sentence.

threshold_step
number
default:0.01

Step size used when automatically calculating the similarity threshold.

delim
default:[".","!","?","\n"]

Delimiters to split sentences on.

include_delim
enum<string> | null
default:prev

Include delimiters in the chunk text. If so, specify whether to include the previous or next delimiter.

Available options:
prev,
next
return_type
enum<string>
default:chunks

Return type for the chunking process. If 'chunks', returns a list of SemanticChunk objects. If 'texts', returns a list of strings.

Available options:
texts,
chunks

Response

200 - application/json

Successful Response: A list of SemanticChunk objects.

A list containing SemanticChunk objects (as SDPM uses semantic chunking), detailing segments and sentences with optional embeddings.

text
string

The actual text content of the chunk.

start_index
integer

The starting character index of the chunk within the original input text.

end_index
integer

The ending character index (exclusive) of the chunk within the original input text.

token_count
integer

The number of tokens in this specific chunk, according to the tokenizer used.

sentences
SemanticSentence · object[]

List of SemanticSentence objects contained within this chunk.