Generate Embeddings

Generate numerical representations of text, also known as embeddings, that can be used for tasks such as text classification, similarity analysis, and clustering. Embeddings capture the meaning and context of text in a way that can be processed by machine learning algorithms. They are commonly used in vector databases to enable efficient similarity search and retrieval of similar items.

Parameters

  • Source Column: The column name containing the text you want to embed. Defaults to content.
  • Destination Column: The column name that holds the embeddings. Defaults to embeddings.
  • LLM: The large language model used to generate embeddings. Defaults to text-embedding-ada-002.
  • Max Token Length: Maximum number of tokens that can be embedded by the LLM. If a cell exceeds the LLM’s maximum number of tokens, the text is truncated to fit the LLM’s maximum input limit. Defaults to 8191.
  • Credential ID: The OpenAI connector from your Mantium account.

Usage

To use the Generate Embeddings transformation, you will need to have a valid API key configured in Mantium for the third-party service (e.g., OpenAI) you want to use. If you don't have one, see the guide here

To use this Mantium Enrichment, follow these steps:

  1. Configure the Source Column parameter by selecting the column containing the text data you want to generate embeddings for.
  2. Configure the Destination Column parameter by specifying the new name for the column that will hold the embeddings.
  3. Configure the Embedding Model parameter by selecting the LLM model to use for the embedding.
  4. Optionally, configure the Max Token Length parameter by specifying the maximum length of the text input.
  5. Configure the Credential ID parameter by selecting the appropriate credential from the list of available credentials in your Mantium account.
  6. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have a new column with the specified name containing the embeddings for the text data in the source column.

Example 1: Generating Text Embeddings for Product Reviews

Suppose you have a dataset of product reviews and you want to generate embeddings for the text in the "review" column. You can use the Generate Embeddings transformation to convert the text data into numerical embeddings.

Sample Dataset:

Review IDReview
1Excellent product, really happy with my purchase.
2Good quality but shipping took longer than expected.

Config:

Source Column: review
Destination Column: embedding
Embedding Model: Text Embedding Ada 002
Max Token Length: 8191
Credential ID: OpenAI

Expected Result Dataset:

Review IDReviewEmbedding
1Excellent product, really happy with my purchase.[0.134, 0.256, ..., -0.321]
2Good quality but shipping took longer than expected.[0.097, -0.216, ..., 0.201]

In this example, a new column called "embedding" is created, containing the embeddings for the text data in the "review" column. The embeddings are generated using the ada-2 model from OpenAI.