Text Processing in NLP

4 min readOct 4, 2024

Text preprocessing is essential before applying any Natural Language Processing (NLP) models, as it ensures the input data is in a format that models can interpret and learn from. Here’s a detailed explanation of key preprocessing steps:

1. Tokenization

Purpose: Break down the input text into smaller units, such as words or sentences, which are easier to analyze.

How It Works:

If you’re tokenizing by word, the sentence “Cats are running” becomes ["Cats", "are", "running"].
If tokenizing by sentence, a paragraph will be split into individual sentences.
Example:
Input: “I love NLP! It’s exciting.”
Word tokenization: ["I", "love", "NLP", "!", "It", "'", "s", "exciting", "."]
Sentence tokenization: ["I love NLP!", "It's exciting."]

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Cats are running. NLP is amazing!"

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)

**OUTPUT FOR THE CODE EXECUTED IN VS CODE**

2. Stop Words Removal

Purpose: Eliminates common words like “is,” “the,” “a,” “and,” which don’t carry significant meaning and may clutter the analysis.

How It Works:

After tokenization, you filter out predefined stop words. The list of stop words may vary based on the context or language model used.
Example:
Input: ["Cats", "are", "running"]
Output after removing stop words ("are" is a stop word): ["Cats", "running"]

nltk.download('stopwords')
from nltk.corpus import stopwords

# Get English stopwords
stop_words = stopwords.words('english')

# Filter out stopwords from word tokens
filtered_sentence = []
for word in word_tokens:
    if word.lower() not in stop_words:
        filtered_sentence.append(word)

print("After Stop Words Removal:", filtered_sentence)

3. Stemming/Lemmatization

Stemming:

Purpose: Reduces words to their base or root form by chopping off suffixes. It may not always result in a meaningful word, but it simplifies the text.
Example:
Input: ["Cats", "running"]
Stemming: ["Cat", "run"] (though not always perfect, like "run" instead of "running").
Common Algorithms: Porter, Snowball, Lancaster stemmers.

#stemming using NLTK
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_sentence]
print("After Stemming:", stemmed_words)

OUTPUT FOR THE CODE EXECUTED IN VS CODE

Lemmatization:

Purpose: Similar to stemming but more sophisticated, it reduces words to their base or dictionary form (lemma), taking into account the meaning of the word.
Example:
Input: ["running"]
Lemmatization: ["run"] (takes into account that "running" is the present participle of "run").

How It Works: It requires context, so it often uses part-of-speech (POS) tagging to decide the right lemma. For instance, “better” can be lemmatized to “good” (adjective), which is something stemming can’t do.

Input: ["Cats", "are", "running"]
Output after lemmatization: ["Cat", "be", "run"]

##......Lemmatization using NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Example sentence
sentence = "Cats are running. NLP is amazing!"

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

# Display lemmatized words
print(lemmatized_words)


##........Lemmatization using spaCy
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Example sentence
sentence = "Cats are running. NLP is amazing"

# Process the sentence
doc = nlp(sentence)

# Extract and print lemmas for each token
lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)

Example Walkthrough:

Let’s say the input is: “Cats are running”.

Tokenization: ["Cats", "are", "running"]
Stop Words Removal: ["Cats", "running"]
Lemmatization: ["Cat", "run"]

import spacy
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def preprocess(text):
    # Tokenization and lemmatization using spaCy
    doc = nlp(text)
    # Remove stop words and perform lemmatization
    processed_text = []
    for word in doc:
      if word.text.lower() not in stop_words and word.is_alpha:
        processed_text.append(word.lemma_)

    return processed_text

text = "Cats are running. NLP is amazing!"
preprocessed_text = preprocess(text)
print("Preprocessed Text:", preprocessed_text)

These preprocessing steps make the text ready for analysis or feeding into an NLP model like a transformer or neural network.

Text Processing in NLP

1. Tokenization

2. Stop Words Removal

3. Stemming/Lemmatization

Example Walkthrough:

Written by Gauri Guglani

No responses yet