Split Text

When working with Enrichments that require input text in smaller chunks, Split Text can be used to break down the text into smaller parts.

Parameters

The Split Text transformation has several parameters:

  • Source Column: The column name containing the text you want to split into smaller parts. Defaults to content.
  • Destination Column: The column name that will hold the segmented text. Defaults to segmented_text.
  • Split By: Specify the unit used to split the text or code. Defaults to word.
  • Split Length: The maximum number of word, sentence or passage units of each split. Defaults to 6100.
  • Split Overlap: The number of overlapping words, sentences, or passages between splits. Setting this Transformation to 0 disables it and setting it to a positive number enables the sliding window approach. Defaults to false.
  • Split Respect Sentence Boundary: Specify whether to split the text at the end of a sentence when Split By is set to word. Defaults to true.

Usage

To use the Split Text transformation in Mantium, follow these steps:

  1. Configure the Source Column parameter by selecting the column that contains the text data to be split.
  2. Configure the Destination Column parameter by specifying the name of the new column that will be created with the split text data.
  3. Set the additional transformation parameters as needed (e.g. Split By, Split Length, etc.) according to your requirements.
  4. Select whether to split the text at the end of a sentence when Split By is set to word.
  5. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have the specified text data split into smaller chunks.

Example 1: Splitting Product Reviews into Smaller Chunks

Imagine you have a data set containing product reviews, and you want to process these reviews using an AI model that has a maximum input length of 50 words. You can use the Split Text transformation to split each review into smaller chunks, ensuring each chunk is no more than 50 words.

The video below demonstrates how to use the Split Text transformation in just a few seconds. If you prefer text, please continue reading.

Original Dataset

review_idreview
1I absolutely love this product! It has changed my life for the better. I have been using it for over a year now, and I can't imagine going back to the way things were before. The design is sleek and modern, and the functionality is top-notch. I highly recommend this product to anyone looking for a significant improvement in their daily routine
2The product is good, but the customer service is lacking. I had some issues with my order, and it took a long time for the company to respond to my inquiries. In the end, everything was resolved, but it was a frustrating experience. The product itself works well, and I am satisfied with its performance. However, the customer service leaves something to be desired.

Transform Config:

Source Column: review
Destination Column: segmented_text
Split By: word
Split Length: 50
Split Overlap: 0
Split Respect Sentence Boundary: true

Transformation - Split Text

Transformation - Split Text

Transformed Dataset

review_idsplit_review
1I absolutely love this product! It has changed my life for the better. I have been using it for over a year now, and I can't imagine going back to the way things were before. The design is sleek and modern, and the functionality is top-notch.
1I highly recommend this product to anyone looking for a significant improvement in their daily routine.
2The product is good, but the customer service is lacking. I had some issues with my order, and it took a long time for the company to respond to my inquiries.
2In the end, everything was resolved, but it was a frustrating experience. The product itself works well, and I am satisfied with its performance. However, the customer service leaves something to be desired.

Example 2: Splitting Text by Sentence and Removing Substrings

Suppose you are working with a data set containing news articles, and you want to analyze the content using an AI model that has a maximum input length of 500 characters. You want to split the text by sentence, remove any URLs, and ensure that no sentence is split in the middle.

Original Dataset

article_idarticle
1The new technology has been gaining traction among consumers, with many praising its innovative features and ease of use. For more information on this exciting development, visit https://www.example.com/newtech. Experts predict that it will continue to grow in popularity, potentially disrupting traditional markets and paving the way for further advancements in the field.
2In a recent study, researchers discovered that certain lifestyle changes can have a significant impact on overall health and well-being. The full report can be found at http://www.example.com/healthstudy. Participants who adopted these changes reported improvements in mood, energy levels, and sleep quality, suggesting that even small adjustments can lead to meaningful results.

Transform Config:

Source Column: article
Destination Column: split_article
Clean White Space: true
Clean Header Footer: false
Clean Empty Lines: true
Remove Substrings: ["http://", "https://"]
Split By: sentence
Split Length: 500
Split Overlap: 0
Split Respect Sentence Boundary: true

Transformed Dataset

article_idsplit_article
1The new technology has been gaining traction among consumers, with many praising its innovative features and ease of use.
1For more information on this exciting development, visit .
1Experts predict that it will continue to grow in popularity, potentially disrupting traditional markets and paving the way for further advancements in the field.
2In a recent study, researchers discovered that certain lifestyle changes can have a significant impact on overall health and well-being.
2The full report can be found at .
2Participants who adopted these changes reported improvements in mood, energy levels, and sleep quality, suggesting that even small adjustments can lead to meaningful results.

Example 3: Splitting Text by Passage with Overlap

You have a data set containing long passages of text, and you want to process them using an AI model that has a maximum input length of 2000 characters. You want to split the text by passage, ensuring there is an overlap of 50 characters between adjacent chunks.

Original Dataset

passage_idpassage
1In the heart of the ancient forest, a mysterious creature roamed the land, known only by the whispers of the wind. The villagers who lived nearby were both fascinated and fearful of this enigmatic being. They spoke of its enormous size, its deep, rumbling growl, and the way it seemed to disappear into the shadows at the first sign of daylight. However, none had ever seen it up close, for it was said that to meet its gaze would mean certain doom. For generations, the villagers had lived in a delicate balance, staying out of the creature's way and respecting its territory. But as the years went by, the boundaries between the village and the forest began to blur. People ventured further into the woods, driven by curiosity and a desire for adventure. The elders warned against such foolishness, but their words fell on deaf ears. And so, the fateful day arrived when the paths of the villagers and the creature finally crossed. The confrontation was swift and brutal, leaving the village reeling in the aftermath. It was a harsh reminder that the ancient forest was not to be trifled with, and that some mysteries were best left unsolved.

Transform Config:

Source Column: passage
Destination Column: split_passage
Clean White Space: true
Clean Header Footer: true
Clean Empty Lines: true
Remove Substrings: None
Split By: passage
Split Length: 2000
Split Overlap: 50
Split Respect Sentence Boundary: false

Transformed Dataset

passage_idsplit_passage
1In the heart of the ancient forest, a mysterious creature roamed the land, known only by the whispers of the wind. The villagers who lived nearby were both fascinated and fearful of this enigmatic being. They spoke of its enormous size, its deep, rumbling growl, and the way it seemed to disappear into the shadows at the first sign of daylight. However, none had ever seen it up close, for it was said that to meet its gaze would mean certain doom. For generations, the villagers had lived in a delicate balance, staying out of the creature's way and respecting its territory. But as the years went by, the boundaries between the village and the forest began to blur. People ventured further into the woods, driven by curiosity and a desire for adventure. The elders warned against such foolishness, but their words fell on deaf ears. And so, the fateful day arrived when the paths of the villagers and the creature finally crossed. The confrontation was swift and brutal, leaving the
1e, and the way it seemed to disappear into the shadows at the first sign of daylight. However, none had ever seen it up close, for it was said that to meet its gaze would mean certain doom. For generations, the villagers had lived in a delicate balance, staying out of the creature's way and respecting its territory. But as the years went by, the boundaries between the village and the forest began to blur. People ventured further into the woods, driven by curiosity and a desire for adventure. The elders warned against such foolishness, but their words fell on deaf ears. And so, the fateful day arrived when the paths of the villagers and the creature finally crossed. The confrontation was swift and brutal, leaving the village reeling in the aftermath. It was a harsh reminder that the ancient forest was not to be trifled with, and that some mysteries were best left unsolved.