Text Processing in NLP
Text preprocessing is essential before applying any Natural Language Processing (NLP) models, as it ensures the input data is in a format that models can interpret and learn from. Here’s a detailed explanation of key preprocessing steps:
1. Tokenization
- Purpose: Break down the input text into smaller units, such as words or sentences, which are easier to analyze.
How It Works:
- If you’re tokenizing by word, the sentence “Cats are running” becomes
["Cats", "are", "running"]
. - If tokenizing by sentence, a paragraph will be split into individual sentences.
- Example:
- Input: “I love NLP! It’s exciting.”
- Word tokenization:
["I", "love", "NLP", "!", "It", "'", "s", "exciting", "."]
- Sentence tokenization:
["I love NLP!", "It's exciting."]
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Cats are running. NLP is amazing!"
# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokenization:", word_tokens)
# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)
2. Stop Words Removal
- Purpose: Eliminates common words like “is,” “the,” “a,” “and,” which don’t carry significant meaning and may clutter the analysis.
How It Works:
- After tokenization, you filter out predefined stop words. The list of stop words may vary based on the context or language model used.
- Example:
- Input:
["Cats", "are", "running"]
- Output after removing stop words (
"are"
is a stop word):["Cats", "running"]
nltk.download('stopwords')
from nltk.corpus import stopwords
# Get English stopwords
stop_words = stopwords.words('english')
# Filter out stopwords from word tokens
filtered_sentence = []
for word in word_tokens:
if word.lower() not in stop_words:
filtered_sentence.append(word)
print("After Stop Words Removal:", filtered_sentence)
3. Stemming/Lemmatization
Stemming:
- Purpose: Reduces words to their base or root form by chopping off suffixes. It may not always result in a meaningful word, but it simplifies the text.
- Example:
- Input:
["Cats", "running"]
- Stemming:
["Cat", "run"]
(though not always perfect, like "run" instead of "running"). - Common Algorithms: Porter, Snowball, Lancaster stemmers.
#stemming using NLTK
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_sentence]
print("After Stemming:", stemmed_words)
Lemmatization:
- Purpose: Similar to stemming but more sophisticated, it reduces words to their base or dictionary form (lemma), taking into account the meaning of the word.
- Example:
- Input:
["running"]
- Lemmatization:
["run"]
(takes into account that "running" is the present participle of "run").
How It Works: It requires context, so it often uses part-of-speech (POS) tagging to decide the right lemma. For instance, “better” can be lemmatized to “good” (adjective), which is something stemming can’t do.
- Input:
["Cats", "are", "running"]
- Output after lemmatization:
["Cat", "be", "run"]
##......Lemmatization using NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Example sentence
sentence = "Cats are running. NLP is amazing!"
# Tokenize the sentence into words
tokens = word_tokenize(sentence)
# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
# Display lemmatized words
print(lemmatized_words)
##........Lemmatization using spaCy
import spacy
# Load the English model
nlp = spacy.load('en_core_web_sm')
# Example sentence
sentence = "Cats are running. NLP is amazing"
# Process the sentence
doc = nlp(sentence)
# Extract and print lemmas for each token
lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)
Example Walkthrough:
Let’s say the input is: “Cats are running”.
- Tokenization:
["Cats", "are", "running"]
- Stop Words Removal:
["Cats", "running"]
- Lemmatization:
["Cat", "run"]
import spacy
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))
def preprocess(text):
# Tokenization and lemmatization using spaCy
doc = nlp(text)
# Remove stop words and perform lemmatization
processed_text = []
for word in doc:
if word.text.lower() not in stop_words and word.is_alpha:
processed_text.append(word.lemma_)
return processed_text
text = "Cats are running. NLP is amazing!"
preprocessed_text = preprocess(text)
print("Preprocessed Text:", preprocessed_text)
These preprocessing steps make the text ready for analysis or feeding into an NLP model like a transformer or neural network.