Normalize Text

Normalize text to make it easier to read and analyze for further processing. With this Transformation, you can normalize whitespace, fix unicode characters, lowercase text, and much more.

Parameters

  • Source Column: The column name containing the text you want to normalize. Defaults to content.
  • Destination Column: The column name that holds the normalized text. Defaults to normalized_text.
  • To Lowercase: Convert to lowercase. Defaults to true.
  • Normalize Whitespace: Remove extra whitespace between words. Defaults to true.
  • Strip Lines: Remove leading or trailing whitespace from each line. Defaults to true.
  • Keep Two Line Breaks: Keep at most two line breaks. Defaults to false.
  • Remove Line Breaks: Remove line breaks. Defaults to false.
  • Clean White Space: Strip whitespaces before or after each line. Defaults to false.
  • Clean Empty Lines: Remove more than two empty lines. Defaults to false.

Usage

  1. Configure the Source Column parameter by selecting the column containing the the text data you want to normalize.
  2. Configure the Destination Column parameter by specifying the new name for the column that will hold the normalized text.
  3. Configure the parameters by selecting different normalization methods.
  4. Optional: Preview the transform to see how the normalized text will look based on the parameters selected.
  5. Run the transformation by clicking the Save and Run Transforms button. The resulting dataset will have a new column with the specified Destination Column name which will contain the normalized text.

Example 1: Normalize Customer Reviews

Suppose you have a dataset of customer reviews and you want to normalize the text in the "review" column. You can use the Normalize Text transformation to standardize the text before using it for sentiment analysis.

IDReview
1The service was EXCELLENT!!! 👍👍👍 I'll definitely be back.
2Worst experience ever. 😡 The staff was rude, and the food was terrible.

To do this, you would configure the transformation as follows:

transform:
  name: Normalize Text
  parameters:
    source_column: review
    destination_column: normalized_review
    to_lowercase: true
    normalize_whitespace: true
    strip_lines: true
    remove_line_breaks: true
    language: en

The resulting dataset would look like this:

IDReviewNormalized_Review
1The service was EXCELLENT!!! 👍👍👍 I'll definitely be back.the service was excellent!!! ill definitely be back
2Worst experience ever. 😡 The staff was rude, and the food was terrible.worst experience ever. the staff was rude, and the food was terrible

Example 2: Normalize Product Titles

Suppose you have a dataset of product titles and you want to normalize the text in the "title" column. You can use the Normalize Text transformation to standardize the text before using it for product categorization.

IDTitle
1NEW ARRIVAL! 🎉🎉🎉 Women's Fashionable Handbag - Perfect for any occasion!
2BUY 2 GET 1 FREE! Men's Stylish Shoes 👞👞👞 - Limited Time Offer!

To do this, you would configure the transformation as follows:

transform:
  name: Normalize Text
  parameters:
    source_column: title
    destination_column: normalized_title
    fix_unicode: true
    to_ascii: true
    to_lowercase: true
    normalize_whitespace: true
    strip_lines: true
    language: en

The resulting dataset would look like this:

IDTitleNormalized_Title
1NEW ARRIVAL! 🎉🎉🎉 Women's Fashionable Handbag - Perfect for any occasion!new arrival! womens fashionable handbag - perfect for any occasion
2BUY 2 GET 1 FREE! Men's Stylish Shoes 👞👞👞 - Limited Time Offer!buy 2 get 1 free! mens stylish shoes - limited time offer