Count Elements

Count individual elements within text, such as words, characters or tokens. This can be particularly helpful for tasks such as checking the length of an article or essay, analyzing the readability of a document, or simply getting a quick overview of the amount of text that needs to be translated, edited or summarized.

Parameters

  • Source Column: The column name containing the text you want to count by Count Type. This field is required, and it defaults to content.
  • Destination Column: Enter the new name for the column that will hold the count result. This is a required field, and it defaults to count.
  • Count Type: Select what you would like to count by (e.g., words, characters, sentences, etc.). This is a required field, and it defaults to words.
  • LLM: The large language model used for counting tokens. LLM is only required when Count Type is token. Defaults to no-selection.

Usage

To use the Count Elements transformation in Mantium, follow these steps:

  1. Configure the Source Column parameter by selecting the column containing the text you want to analyze.
  2. Configure the Destination Column parameter by specifying the new name for the column that will hold the element count result.
  3. Configure the Count Type parameter by selecting the type of element you want to count.
  4. Optionally, configure the LLM parameter by selecting the tokenizer model to use for token counting.
  5. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have a new column with the specified name containing the element count for each data point in the source column.

Example 1 - Word Count

Suppose you have a dataset containing article titles and you want to count the number of words in each title.

The video below demonstrates the example in just a few seconds. If you prefer text, please continue reading.

Sample Dataset:

Article IDTitle
1The Future of Artificial Intelligence
2Climate Change and Its Impact on Ecosystems

Configuration:

Source Column: Title
Destination Column: Word Count
Element Type: words
Transformation - Word Count

Transformation - Word Count

Expected Result Dataset:

Article IDTitleWord Count
1The Future of Artificial Intelligence5
2Climate Change and Its Impact on Ecosystems7

Example 2 - Character Count

Suppose you have a dataset containing customer feedback, and you want to count the number of characters in each feedback entry.

The video below demonstrates the example in just a few seconds. If you prefer text, please continue reading.

Sample Dataset:

Feedback IDFeedback
1Great product, fast shipping.
2Had some issues, but customer support was helpful.

Configuration:

Source Column: Feedback
Destination Column: Character Count
Element Type: characters
Transformation - Character Count

Transformation - Character Count

Expected Result Dataset:

Feedback IDFeedbackCharacter Count
1Great product, fast shipping.29
2Had some issues, but customer support was helpful.50

Example 3 - Token Count

Suppose you have a dataset containing customer feedback in different languages, and you want to count the number of tokens in each feedback entry using the cl100k_base tokenizer.

The video below demonstrates the example in just a few seconds. If you prefer text, please continue reading.

Sample Dataset:

Feedback IDFeedback
1Great product, fast shipping.
2Très bon produit, livraison rapide.

Configuration:

Source Column: Feedback
Destination Column: Token Count
Element Type: words
Tokenizer: cl100k_base
Transformation - Token Count

Transformation - Token Count

Expected Result Dataset:

Feedback IDFeedbackToken Count
1Great product, fast shipping.6
2Très bon produit, livraison rapide.9

In this example, we use the cl100k_base tokenizer, which is more advanced and better suited for handling different languages and more complex tokenization tasks. The tokenizer recognizes punctuation as separate tokens and provides a more accurate count of tokens in the text.