12.2.10 Normalization: Stemming and Lemmatization

  • Stemming removes a prefix or suffix from a word leaving only a stem, which may or may not be a real word
  • Lemmatization is similar, but factors in the word’s part of speech and meaning and results in a real word
  • Both normalize words for analysis
    • Before calculating statistics on words in a body of text, you might convert all words to lowercase so that capitalized and lowercase words are not treated differently.
  • You might want to use a word’s root to represent the word’s many forms.
    • E.g., treat "program" and "programs" as "program"
In [1]:
from textblob import Word
In [2]:
word = Word('varieties')
In [3]:
word.stem()
Out[3]:
'varieti'
In [4]:
word.lemmatize()
Out[4]:
'variety'

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.