POST
/
v1
/
chunk
/
sentence
curl --request POST \
  --url https://api.chonkie.ai/v1/chunk/sentence \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form tokenizer_or_token_counter=gpt2 \
  --form chunk_size=512 \
  --form chunk_overlap=0 \
  --form min_sentences_per_chunk=1 \
  --form min_characters_per_sentence=12 \
  --form approximate=false \
  --form 'delim=<string>' \
  --form include_delim=prev \
  --form return_type=chunks
[
  {
    "text": "<string>",
    "start_index": 123,
    "end_index": 123,
    "token_count": 123,
    "sentences": [
      {
        "text": "<string>",
        "start_index": 123,
        "end_index": 123,
        "token_count": 123
      }
    ]
  }
]

Authorizations

Authorization
string
header
required

Your API Key from the Chonkie Cloud dashboard

Body

multipart/form-data
file
file

The file to chunk.

tokenizer_or_token_counter
string
default:gpt2

Tokenizer or token counting function to use. Can be a string identifier or an instance.

chunk_size
integer
default:512

Maximum number of tokens per chunk.

chunk_overlap
integer
default:0

Number of overlapping tokens between chunks.

min_sentences_per_chunk
integer
default:1

Minimum number of sentences to include in each chunk.

min_characters_per_sentence
integer
default:12

Minimum number of characters per sentence.

approximate
boolean
default:false

Use approximate token counting for faster processing (deprecated).

delim
default:[".","!","?","\n"]

Delimiters to split sentences on.

include_delim
enum<string> | null
default:prev

Include delimiters in the chunk text. If so, specify whether to include the previous or next delimiter.

Available options:
prev,
next
return_type
enum<string>
default:chunks

Whether to return chunks as text strings or as SentenceChunk objects.

Available options:
texts,
chunks

Response

200 - application/json
Successful Response: A list of `SentenceChunk` objects.
text
string

The actual text content of the chunk.

start_index
integer

The starting character index of the chunk within the original input text.

end_index
integer

The ending character index (exclusive) of the chunk within the original input text.

token_count
integer

The number of tokens in this specific chunk, according to the tokenizer used.

sentences
object[]

List of Sentence objects contained within this chunk.

Represents a single sentence with metadata, used within sentence-based chunks.