Machine learning :: python stemming and lemmatization

Python - Stemming and Lemmatization

In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.

The below program uses the Porter Stemming Algorithm for stemming.

import nltk

from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"

# First Word tokenization

nltk_tokens = nltk.word_tokenize(word_data)

#Next find the roots of the word

for w in nltk_tokens:

print "Actual: %s Stem: %s" %(w,porter_stemmer.stem(w))

When we execute the above code, it produces the following result.

Actual: It Stem: It

Actual: originated Stem: origin

Actual: from Stem: from

Actual: the Stem: the

Actual: idea Stem: idea

Actual: that Stem: that

Actual: there Stem: there

Actual: are Stem: are

Actual: readers Stem: reader

Actual: who Stem: who

Actual: prefer Stem: prefer

Actual: learning Stem: learn

Actual: new Stem: new

Actual: skills Stem: skill

Actual: from Stem: from

Actual: the Stem: the

Actual: comforts Stem: comfort

Actual: of Stem: of

Actual: their Stem: their

Actual: drawing Stem: draw

Actual: rooms Stem: room

Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.

import nltk

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"

nltk_tokens = nltk.word_tokenize(word_data)

for w in nltk_tokens:

print "Actual: %s Lemma: %s" %(w,wordnet_lemmatizer.lemmatize(w))

When we execute the above code, it produces the following result.

Actual: It Lemma: It

Actual: originated Lemma: originated

Actual: from Lemma: from

Actual: the Lemma: the

Actual: idea Lemma: idea

Actual: that Lemma: that

Actual: there Lemma: there

Actual: are Lemma: are

Actual: readers Lemma: reader

Actual: who Lemma: who

Actual: prefer Lemma: prefer

Actual: learning Lemma: learning

Actual: new Lemma: new

Actual: skills Lemma: skill

Actual: from Lemma: from

Actual: the Lemma: the

Actual: comforts Lemma: comfort

Actual: of Lemma: of

Actual: their Lemma: their

Actual: drawing Lemma: drawing

Actual: rooms Lemma: room

24loader, Home of exclusive blog and update music and entertaiment

Header Ads

No comments

Facebook

Videos

Popular

Recent

Comments

Tags

Related Post No.

ads

Popular Posts

Photography

Recent in Sports

Custom Widget

Related Post No.