Text Processing in NLP

Gauri Guglani
4 min readOct 4, 2024

--

Text preprocessing is essential before applying any Natural Language Processing (NLP) models, as it ensures the input data is in a format that models can interpret and learn from. Here’s a detailed explanation of key preprocessing steps:

1. Tokenization

  • Purpose: Break down the input text into smaller units, such as words or sentences, which are easier to analyze.

How It Works:

  • If you’re tokenizing by word, the sentence “Cats are running” becomes ["Cats", "are", "running"].
  • If tokenizing by sentence, a paragraph will be split into individual sentences.
  • Example:
  • Input: “I love NLP! It’s exciting.”
  • Word tokenization: ["I", "love", "NLP", "!", "It", "'", "s", "exciting", "."]
  • Sentence tokenization: ["I love NLP!", "It's exciting."]
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Cats are running. NLP is amazing!"

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)
OUTPUT FOR THE CODE EXECUTED IN VS CODE

2. Stop Words Removal

  • Purpose: Eliminates common words like “is,” “the,” “a,” “and,” which don’t carry significant meaning and may clutter the analysis.

How It Works:

  • After tokenization, you filter out predefined stop words. The list of stop words may vary based on the context or language model used.
  • Example:
  • Input: ["Cats", "are", "running"]
  • Output after removing stop words ("are" is a stop word): ["Cats", "running"]
nltk.download('stopwords')
from nltk.corpus import stopwords

# Get English stopwords
stop_words = stopwords.words('english')

# Filter out stopwords from word tokens
filtered_sentence = []
for word in word_tokens:
if word.lower() not in stop_words:
filtered_sentence.append(word)

print("After Stop Words Removal:", filtered_sentence)
OUTPUT FOR THE CODE EXECUTED IN VS CODE

3. Stemming/Lemmatization

Stemming:

  • Purpose: Reduces words to their base or root form by chopping off suffixes. It may not always result in a meaningful word, but it simplifies the text.
  • Example:
  • Input: ["Cats", "running"]
  • Stemming: ["Cat", "run"] (though not always perfect, like "run" instead of "running").
  • Common Algorithms: Porter, Snowball, Lancaster stemmers.
#stemming using NLTK
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_sentence]
print("After Stemming:", stemmed_words)
OUTPUT FOR THE CODE EXECUTED IN VS CODE

Lemmatization:

  • Purpose: Similar to stemming but more sophisticated, it reduces words to their base or dictionary form (lemma), taking into account the meaning of the word.
  • Example:
  • Input: ["running"]
  • Lemmatization: ["run"] (takes into account that "running" is the present participle of "run").

How It Works: It requires context, so it often uses part-of-speech (POS) tagging to decide the right lemma. For instance, “better” can be lemmatized to “good” (adjective), which is something stemming can’t do.

  • Input: ["Cats", "are", "running"]
  • Output after lemmatization: ["Cat", "be", "run"]
##......Lemmatization using NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Example sentence
sentence = "Cats are running. NLP is amazing!"

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

# Display lemmatized words
print(lemmatized_words)


##........Lemmatization using spaCy
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Example sentence
sentence = "Cats are running. NLP is amazing"

# Process the sentence
doc = nlp(sentence)

# Extract and print lemmas for each token
lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)

Example Walkthrough:

Let’s say the input is: “Cats are running”.

  1. Tokenization: ["Cats", "are", "running"]
  2. Stop Words Removal: ["Cats", "running"]
  3. Lemmatization: ["Cat", "run"]
import spacy
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def preprocess(text):
# Tokenization and lemmatization using spaCy
doc = nlp(text)
# Remove stop words and perform lemmatization
processed_text = []
for word in doc:
if word.text.lower() not in stop_words and word.is_alpha:
processed_text.append(word.lemma_)

return processed_text

text = "Cats are running. NLP is amazing!"
preprocessed_text = preprocess(text)
print("Preprocessed Text:", preprocessed_text)
Text Processing together in one block

These preprocessing steps make the text ready for analysis or feeding into an NLP model like a transformer or neural network.

--

--

Gauri Guglani
Gauri Guglani

Written by Gauri Guglani

Data Science |Technology |Motivation | Reader | Writer | Foodie| YT- https://www.youtube.com/@GauriGuglani

No responses yet