Clean Wikipedia Text

Clean Wikipedia text by removing multiple new lines, removing extremely short lines, adding paragraph breaks and removing empty paragraphs.

Parameters

Source Column: The column name containing the Wikipedia text you want to clean. Defaults to content.
Destination Column: The column name that holds the clean Wikipedia text. Defaults to clean_Wikipedia_text.

Usage

To use the Clean Wikipedia Text transformation, you will need to follow these steps:

  1. Specify the Source Column parameter with the name of the column that contains the Wikipedia text you want to clean.
  2. Specify the Destination Column parameter with the name of the column that will hold the clean Wikipedia text.
  3. Run the transformation by clicking the Save and Run Transforms button.

Example 1: Cleaning Wikipedia Text with Multiple New Lines and Short Lines

Suppose you have a dataset of Wikipedia text with multiple new lines, extremely short lines, and empty paragraphs. You want to clean the text in the "wiki_text" column.

IDWiki_Text
1Albert Einstein was a theoretical physicist.\n\n\nHe developed the theory of relativity.\n\n\n\n\n
2Isaac Newton was an English mathematician.\n\n\n\n\nHe is known for his laws of motion.\nA.\nB.

Parameters (YAML):

transform:
  name: Clean Wikipedia Text
  parameters:
    source_column: Wiki_Text
    destination_column: clean_Wikipedia_text

Expected Result Dataset:

IDWiki_TextClean_Wikipedia_Text
1Albert Einstein was a theoretical physicist.\n\n\nHe developed the theory of relativity.\n\n\n\n\nAlbert Einstein was a theoretical physicist.\n\nHe developed the theory of relativity.
2Isaac Newton was an English mathematician.\n\n\n\n\nHe is known for his laws of motion.\nA.\nB.Isaac Newton was an English mathematician.\n\nHe is known for his laws of motion.

Example 2: Cleaning Wikipedia Text with Empty Paragraphs

Suppose you have a dataset of Wikipedia text with empty paragraphs. You want to clean the text in the "wiki_text" column.

IDWiki_Text
1Marie Curie was a physicist and chemist.\n\n\n\n\nShe conducted research on radioactivity.\n\n\n\n
2Charles Darwin was a naturalist and biologist.\n\n\n\n\nHe is known for his theory of evolution.

Parameters (YAML):

transform:
  name: Clean Wikipedia Text
  parameters:
    source_column: Wiki_Text
    destination_column: clean_Wikipedia_text

Expected Result Dataset:

IDWiki_TextClean_Wikipedia_Text
1Marie Curie was a physicist and chemist.\n\n\n\n\nShe conducted research on radioactivity.\n\n\n\nMarie Curie was a physicist and chemist.\n\nShe conducted research on radioactivity.
2Charles Darwin was a naturalist and biologist.\n\n\n\n\nHe is known for his theory of evolution.Charles Darwin was a naturalist and biologist.\n\nHe is known for his theory of evolution.

What’s Next