Text Preprocessing in NLP

Introduction

Natural language processing (NLP) is a branch of linguistics, computer science, and artificial intelligence that studies how computers interact with human language, like speech and text.

Text preprocessing is a method in NLP that involves cleaning text data and making it ready for model building.

Text data can emanate from various sources such as websites, spoken languages, and voice recognition systems. It is essential to clean and prepare text data for NLP-specific tasks by applying various techniques on the corpus of text where the goal is to make better and often more efficient NLP models. It is note-worthy to mention that, text preprocessing is not transferable from one NLP task to another, and most methods of preprocessing are domain-specific.

In this tutorial, you will learn the common text preprocessing methods such as; Noise Removal, Tokenization, Stop Words, Stemming, Lemmatisation. This tutorial will introduce concepts with examples, and more tutorials will build on the understanding you will derive from here. The reason for covering some of these concepts is to understand how modern NLP works from the first principles.

Noise Removal

Text data can sometimes have unwanted information in the source of data. For example, text data scraped from a website will contain HTML tags. Also, you might want to remove punctuation, whitespace, numbers, special characters etc., from the text data. Removing noise from text data is an important step in preprocessing, and it is domain-specific.

The definition of noise depends on the specific NLP task you're working on. In one context, punctuation is noise, but, in another, it's essential information. For example, if you're estimating a Topic Model, you almost always want to remove punctuation. However, if you're training a Transformer, you almost always want to leave punctuation in your data.

Code Example

Here, we will explain how Python's regular expression module can remove unwanted information from the data.

We will assume that you have a prior knowledge on how regular expression works. It's okay if you don't we will provide the basic understanding that is needed to follow the example below.

For this example, we will be using the re module, and .sub method from the module to clean the example text data.

How does the `.sub` method work ?

The .sub method takes three arguments as shown in the example syntax below.

re.sub(r'pattern', 'replacement_string', 'input_string')

To use the .sub method, a preceding r must indicate we are dealing with a raw string, followed by the pattern you want the regular expression to search for. The next argument is the replacement text string, which replaces the string that matches the pattern in the input string. Finally, the last argument is the desired input string.

Example - Cleaning web-scraped text data

header_text = '<h1>Mantium provides safety and security for AI Developers</h1>'

Using the .sub method, let's remove the HTML header tags from the web-scraped text string.

import re #import the regular expression library
header_text = '<h1>Mantium provides safety and security for AI Developers</h1>'

#replace the opening and closing tag with empty string
clean_text = re.sub(r'<.?h1>', '', header_text)

print(clean_text)
# Mantium provides safety and security for AI Developers

The clean version of the header_text without the HTML tags was generated using a string pattern that matches the open and closing header tags, and it was replaced with empty string in the second argument. The above is an example of noise removal

Noise removal is generally the first step in text processing as it can affect other steps such as Stemming, which we will look at shortly.
In summary, Noise removal involves removing characters, digits and strings that can affect the results of your text analysis.

Tokenization

For a machine to model natural language, it needs to transform text into a set of basic elements, usually called Tokens. The method of breaking text data into smaller or basic elements is called Tokenization.
It is important to understand that the aforementioned basic elements are considered minimal meaningful units of information such as words, punctuation marks, symbols, special tokens and Model-specific special tokens, and not only individual words. Tokens can be also be subword elements, in this case, complex words are decomposed into meaningful subwords, a good example is breaking the word Tokenization into Token and ization

Consider this sentence This is an example.

By performing tokenization on this, we have extracted tokens to be "This", "is", "an", "example" and ".".
This type of tokenization is known as White Space Tokenization, and it is the simplest technique, where tokenization is performed by splitting input whenever a white space is seen.
Although this method is fast, the drawback is that it only works for some type of language where white space breaks sentences apart, such as the English Language. There are other important tools for the tokenization, such as spaCy Tokenizers, Tokenization with NLTK(Natural Language ToolKit), Hugging Face Tokenizers etc.

Code example of Tokenization using the NLTK library

from nltk.tokenize import word_tokenize
text = "This is an example."
tokenized = word_tokenize(text)
print(tokenized)

# ['This', 'is', 'an', 'example', '.']

Tokenizers are tools that break text into discrete units, which are referred to as tokens. This process is called tokenization. While it is very common for a tokenizer to use spaces to tokenize a string (e.g. break it into individual words), there are no strict rules about how tokenization is carried out. For example, some tokenizers break words into sub-components rather than simply focusing on white space.

After you tokenize a document or collection of documents, you can identify the set of unique tokens that occur in your data. This set of unique tokens is referred to as your vocabulary. As such, your tokenizer directly impacts the content of your vocabulary. If your tokens are individual words, your vocabulary will contain individual words; however, if your tokens are sub-word strings or characters, then your vocabulary will consist of those components instead of words. Once your vocabulary is defined, you can map each unique token in your vocabulary to a unique numerical value. This is how words are converted to numbers in NLP.

Stop Words

There are common words that we don't need for text analysis, they appear frequently and can have negative effects on our models, such as increasing their computational cost or and adding noise. Examples are; a , the , and etc. These words are known as stop words, as they don't give any additional information.
In some cases, we don't always want to remove these words. An example is when dealing with transformer-based models, and the relationship between these words matters. When using pre-trained models, it's best to feed them data that's comparable to what they were trained with.

Example

Remove Stop words from the following text - Mantium is a platform that helps NLP developers build productively with Large Language Models

Step 1 - Import from the stopwords module from the NLTK library, and create a set with stop words

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))

Step 2 - Use the split method to tokenize the text string, and list comprehension to remove words that are present in the stop words set, and are also in the tokenized list.

text = 'Mantium is a platform that helps NLP developers build productively with Large Language Models'

text = text.lower()
text = text.split()
text_no_stop = [word for word in text if word not in stop_words]

Results

Print the final tokens without stop words

print(text_no_stop)
['mantium', 'platform', 'helps', 'nlp', 'developers', 'build', 'productively', 'large', 'language', 'models']

Print the text data with stop words, and without stop words.

Text with stop words

# Text with stop words

print(' '.join(text))
# mantium is a platform that helps nlp developers build productively with large language models

Text without stop words

# Text without stop words
print(' '.join(text_no_stop))
# mantium platform helps nlp developers build productively large language models

By comparing both results, you will notice that we've removed words like; is , a, with and we only have words that are more informative.

Stemming & Lemmatization

In natural language, words are modified in order to convey meaning. For example, in English, verbs - usually from their base form - are modified to reflect different tenses, for example, generate, generated, generating. In this example, the base form is generate.

Stemming is the process of reducing words to their base form or word stems. Stemming is a heuristic procedure that removes words from the end until the stem is reached, this is done with the hopes of getting it right most of the time. The problem with this is that English has many exceptions where a more complex approach is required to get words in their base forms.

Stemming is similar to lemmatization but simpler because lemmatization involves a more rigorous approach that entails using a vocabulary and morphological analysis of words. Lemmatization returns a word to its base or canonical form, per the dictionary. Examples are slept to sleep, espousing to espouse. Compared to stemming, it requires the knowledge of the part of speech of words, and it is more computationally expensive.

Stemming algorithms are typically rule-based; the most common and effective one is the Porter's Algorithm developed by Martin Porter in 1980. The algorithm uses five phases of word reduction, each with its own set of mapping rules applied sequentially. Martin Porter also developed a stemming package known as Snowball, and it offers a slight improvement over the original Porter stemmer algorithm in terms of logic and speed.

Perform Stemming on the these tokens - run, ran, running, many, easily

Step 1 - Here, we are going to import the PorterStemmer module from the Natural Language Toolkit library, and also the Pandas library to create dataframe for displaying our final results.

from nltk.stem import PorterStemmer
import pandas as pd

Step 2 - Here, you will create an instance of the PorterStemmer object.

stemmer = PorterStemmer()

Step 3 - Create a list of tokens

words = ['run', 'ran', 'running', 'many', 'easily']

Step 4 - To stem each word in the list above, we will use a list comprehension to apply to each word in the list.

stemmed_words = [stemmer.stem(token) for token in words]

Step 5 - Create a Pandas DataFrame to compare the unstemmed words to the stemmed words.

stemdf= pd.DataFrame({'Raw_words': words,'Stemmed_Word': stemmed_words})
print(stemdf)

Raw_words Stemmed_Word
0       run          run
1       ran          ran
2   running          run
3      many         mani
4    easily       easili

You will notice that the stemmed version of many is mani and easily is easili , so we can say that stemming bluntly changed or remove word affixes (prefixes and suffixes).

In the next example, we will see how lemmatization works, which involves bringing down words into their root forms.

There are many methods of lemmatization, and Python has packages that supports these methods.

We have Wordnet Lemmatizer, Wordnet Lemmatizer with appropriate Part of Speech tag, Spacy lemmatization etc.

For this example will use Wordnet Lemmatizer with appropriate Part of Speech tag, e.g Verb, and we specify that using the part of speech argument.

# Import the lemmatization module
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import pandas as pd

words = ['running', 'ran', 'espousing', 'many', 'easily', 'was']

lemmatizer = WordNetLemmatizer()

# lemmatization using lemmatize, and passing part of speech as a verb
lemmatized_words = [lemmatizer.lemmatize(word=word,pos='v') for word in words]

lemdf= pd.DataFrame({'Raw_words': words,'Lemmatized_Word': lemmatized_words})
print(lemdf)

# Result
Raw_words Lemmatized_Word
0    running             run
1        ran             run
2  espousing         espouse
3       many            many
4     easily          easily
5        was              be

From the results above, you will notice that the words were converted to their base forms. In comparison with stemming, the stems returned by lemmatization are actual dictionary words.

Conclusion

The aim of this tutorial is to introduce text preprocessing concepts in Natural Language and how they can affect text analysis and modelling. We hope that you can build some intuition on how we process text data, which will help in understanding modern NLP applications.