Clean Text

Used to clean the text in a specific column of a dataset. This transformation is useful when you want to standardize or preprocess text data before using it for analysis or machine learning purposes.

Parameters

  • Target Column: The column name that has the cleaning transformation applied to it. This is a required field.
  • Source Column: The output column name for the cleaned text. Defaults to "cleaned_text".
  • Fix Unicode: Convert Unicode characters to ASCII equivalents. This is an optional field.
  • To Ascii: Convert non-ASCII characters to ASCII equivalents. This is an optional field.
  • To Lowercase: Convert all text to lowercase. This is an optional field.
  • Normalize Whitespace: Remove any extra whitespace between words. This is an optional field.
  • Strip Lines: Remove any leading or trailing whitespace from each line. This is an optional field.
  • Keep Two Line Breaks: Keep at most two line breaks. This is an optional field.
  • Remove Line Breaks: Remove all line breaks. This is an optional field.
  • Remove URLs: Remove any URLs from the text. This is an optional field.
  • Remove Emails: Remove any email addresses from the text. This is an optional field.
  • Remove Phone Numbers: Remove any phone numbers from the text. This is an optional field.
  • Remove Numbers: Remove any numbers from the text. This is an optional field.
  • Remove Digits: Remove any digits from the text. This is an optional field.
  • Remove Currency Symbols: Remove any currency symbols from the text. This is an optional field.
  • Remove Punctuation: Remove any punctuation from the text. This is an optional field.
  • Remove Emojis: Remove any emojis from the text. This is an optional field.
  • Replace With URL: Define what to replace URLs with. Default is <URL>.
  • Replace With Email: Define what to replace email addresses with. Default is <EMAIL>.
  • Replace With Phone Number: Define what to replace phone numbers with. Default is <PHONE>.
  • Replace With Digit: Define what to replace digits with. This is an optional field.
  • Replace With Currency Symbol: Define what to replace currency symbols with. Default is <CUR>.
  • Replace With Punctuation: Define what to replace punctuation with. This is an optional field.
  • Language: Define the language of the text. Default is English.
  • Clean Wiki Text: Boolean field to clean wiki text. This is an optional field.

Usage

  1. Select the transformation from the list of available transformations in the Mantium user interface.
  2. Configure the Source Column parameter by selecting the column containing the audio files you want to transcribe.
  3. Configure the Cleaned Content Column parameter by specifying the new name for the column that will hold the clean text.
  4. Configure the parameters by selecting different cleaning methods or defining replacements as per the parameters above.
  5. Optional: Preview the transform to see how the cleaned text will look based on the parameters selected.
  6. Run the transformation by clicking the Run button. The resulting dataset will have a new column with the specified Destination Column name which will contain the transcribed text for each audio file in the source column.

Example 1: Clean Social Media Comments

Suppose you have a dataset of social media comments and you want to clean the text in the "comment" column. You can use the Clean Text transformation to standardize the text before using it for sentiment analysis.

IDComment
1Great product! Check out my website: http://www.example.com/ 😊

To do this, you would configure the transformation as follows:

transform:
  name: Clean Text
  parameters:
    target_column: comment
    source_column: cleaned_comment
    to_lowercase: true
    normalize_whitespace: true
    remove_urls: true
    remove_emails: true
    remove_emojis: true
    replace_with_url: '<URL>'
    replace_with_email: '<EMAIL>'
    language: english

The resulting dataset would look like this:

IDCommentCleaned_Comment
1Great product! Check out my website: http://www.example.com/ 😊great product! check out my website:
2I love this! Email me at [email protected] for more info.i love this! email me at for more info.

Example 2: Clean Movie Reviews

Suppose you have a dataset of movie reviews and you want to clean the text in the "review" column. You can use the Clean Text transformation to standardize the text before using it for text classification.

IDReview
1I loved this movie! Best one I've seen in a while. (5/5)
2Terrible movie. I walked out after 30 minutes. (1/5)

To do this, you would configure the transformation as follows:

transform:
  name: Clean Text
  parameters:
    target_column: review
    source_column: cleaned_review
    to_lowercase: true
    normalize_whitespace: true
    remove_punctuation: true
    remove_digits: true
    replace_with_punctuation: ''
    language: english

The resulting dataset would look like this:

IDReviewCleaned_Review
1I loved this movie! Best one I've seen in a while. (5/5)i loved this movie best one ive seen in a while
2Terrible movie. I walked out after 30 minutes. (1/5)terrible movie i walked out after minutes