Clean Wikipedia Text
Clean Wikipedia text by removing multiple new lines, removing extremely short lines, adding paragraph breaks and removing empty paragraphs.
Parameters
Source Column: The column name containing the Wikipedia text you want to clean. Defaults to content
.
Destination Column: The column name that holds the clean Wikipedia text. Defaults to clean_Wikipedia_text
.
Usage
To use the Clean Wikipedia Text transformation, you will need to follow these steps:
- Specify the Source Column parameter with the name of the column that contains the Wikipedia text you want to clean.
- Specify the Destination Column parameter with the name of the column that will hold the clean Wikipedia text.
- Run the transformation by clicking the Save and Run Transforms button.
Example 1: Cleaning Wikipedia Text with Multiple New Lines and Short Lines
Suppose you have a dataset of Wikipedia text with multiple new lines, extremely short lines, and empty paragraphs. You want to clean the text in the "wiki_text" column.
ID | Wiki_Text |
---|---|
1 | Albert Einstein was a theoretical physicist.\n\n\nHe developed the theory of relativity.\n\n\n\n\n |
2 | Isaac Newton was an English mathematician.\n\n\n\n\nHe is known for his laws of motion.\nA.\nB. |
Parameters (YAML):
transform:
name: Clean Wikipedia Text
parameters:
source_column: Wiki_Text
destination_column: clean_Wikipedia_text
Expected Result Dataset:
ID | Wiki_Text | Clean_Wikipedia_Text |
---|---|---|
1 | Albert Einstein was a theoretical physicist.\n\n\nHe developed the theory of relativity.\n\n\n\n\n | Albert Einstein was a theoretical physicist.\n\nHe developed the theory of relativity. |
2 | Isaac Newton was an English mathematician.\n\n\n\n\nHe is known for his laws of motion.\nA.\nB. | Isaac Newton was an English mathematician.\n\nHe is known for his laws of motion. |
Example 2: Cleaning Wikipedia Text with Empty Paragraphs
Suppose you have a dataset of Wikipedia text with empty paragraphs. You want to clean the text in the "wiki_text" column.
ID | Wiki_Text |
---|---|
1 | Marie Curie was a physicist and chemist.\n\n\n\n\nShe conducted research on radioactivity.\n\n\n\n |
2 | Charles Darwin was a naturalist and biologist.\n\n\n\n\nHe is known for his theory of evolution. |
Parameters (YAML):
transform:
name: Clean Wikipedia Text
parameters:
source_column: Wiki_Text
destination_column: clean_Wikipedia_text
Expected Result Dataset:
ID | Wiki_Text | Clean_Wikipedia_Text |
---|---|---|
1 | Marie Curie was a physicist and chemist.\n\n\n\n\nShe conducted research on radioactivity.\n\n\n\n | Marie Curie was a physicist and chemist.\n\nShe conducted research on radioactivity. |
2 | Charles Darwin was a naturalist and biologist.\n\n\n\n\nHe is known for his theory of evolution. | Charles Darwin was a naturalist and biologist.\n\nHe is known for his theory of evolution. |
Updated over 1 year ago