Chonkie allows you to use your own embeddings handler by creating a child class of the BaseEmbeddings
class, and implementing the necessary methods. It’s quite simple!
Example
First, we create a child class of the BaseEmbeddings
class, and implement the necessary methods.
from chonkie.embeddings import BaseEmbeddings
class CustomEmbeddings(BaseEmbeddings):
@property
def dimension(self) -> int:
...
def embed(self, text: str) -> "np.ndarray":
...
def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
...
def count_tokens(self, text: str) -> int:
...
def count_tokens_batch(self, texts: List[str]) -> List[int]:
...
def get_tokenizer_or_token_counter(self):
...
@classmethod
def is_available(cls) -> bool:
...
def __repr__(self) -> str:
...
At this point, we have a custom embeddings handler, we can use it like this:
embeddings = CustomEmbeddings()
But let’s say we want to use this together with the AutoEmbeddings
class, for the sake of convenience. We can do this by registering it with the EmbeddingsRegistry
.
from chonkie.embeddings import EmbeddingsRegistry
# Register with the embeddings registry
EmbeddingsRegistry.register(
"custom",
CustomEmbeddings,
pattern=r"^custom/|^model-name",
valid_types=["CustomEmbeddings"]
)
Now we can use our custom embeddings handler with the AutoEmbeddings
class.
embeddings = AutoEmbeddings.get_embeddings("custom/my-custom-embeddings")
Finally, we can use our custom embeddings handler in the same way we would use any other embeddings handler.
chunker = SemanticChunker(embeddings=embeddings, similarity_threshold=0.7)
chunks = chunker(text)