12.2 TextBlob

  • https://textblob.readthedocs.io/en/latest/
  • Object-oriented NLP text-processing library that is built on the NLTK and pattern NLP libraries
  • Some of the NLP tasks TextBlob can perform include:
    • Tokenization—splitting text into pieces called tokens, which are meaningful units, such as words and numbers
    • Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.
    • Noun phrase extraction—locating groups of words that represent nouns, such as “red brick factory.”
      • The phrase “red brick factory” illustrates why natural language is such a difficult subject. Is a “red brick factory” a factory that makes red bricks? Is it a red factory that makes bricks of any color? Is it a factory built of red bricks that makes products of any type? In today’s music world, it could even be the name of a rock band or the name of a game on your smartphone.
    • Sentiment analysis—determining whether text has positive, neutral or negative sentiment.
    • Inter-language translation and language detection powered by Google Translate.

12.2 TextBlob (cont.)

  • Some of the NLP tasks TextBlob can perform include: (cont.)
    • Inflection—pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob.
    • Spell checking and spelling correction.
    • Stemming—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”
    • Lemmatization—like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”
    • Word frequencies—determining how often each word appears in a corpus.
    • WordNet integration for finding word definitions, synonyms and antonyms.
    • Stop word elimination—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.
    • n-grams—producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.

12.2 TextBlob (cont.)

Installing the TextBlob Module

conda install -c conda-forge textblob
  • Once installation completes, execute the following command to download the NLTK corpora used by TextBlob:
    ipython -m textblob.download_corpora

12.2 TextBlob (cont.)

Project Gutenberg

  • Great source of text for analysis is the free e-books at Project Gutenberg:

    https://www.gutenberg.org

  • Over 57,000 e-books in various formats, including plain text files
  • Out of copyright in the United States
  • Terms of Use and copyright in other countries
  • Some examples use the plain-text e-book file for Shakespeare’s Romeo and Juliet: https://www.gutenberg.org/ebooks/1513
  • You’re required to copy books for programmatic access

    https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

  • To download
    • Right click the Plain Text UTF-8 link on a book’s web page
    • Select Save Link As… (Chrome/FireFox), Download Linked File As… (Safari) or Save target as (Microsoft Edge)
  • Save Romeo and Juliet as RomeoAndJuliet.txt in the ch12 examples
  • For analysis purposes, we removed the Project Gutenberg text before "THE TRAGEDY OF ROMEO AND JULIET", as well as the Project Guttenberg information at the end of the file starting with: "End of the Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare"

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.