12.9 Natural Language Datasets

  • Wikipedia—some or all of Wikipedia (https://meta.wikimedia.org/wiki/Datasets)
  • IMDB (Internet Movie Database)—various movie and TV datasets are available.
  • UCIs text datasets—many datasets, including the Spambase dataset.
  • Project Gutenberg—50,000+ free e-books that are out-of-copyright in the U.S.
  • Jeopardy! dataset—200,000+ questions from the Jeopardy! TV show. A milestone in AI occurred in 2011 when IBM Watson famously beat two of the world’s best Jeopardy! players.
  • Natural language processing datasets
  • NLTK data
  • Sentiment labeled sentences data set (from sources including IMDB.com, amazon.com, yelp.com)
  • Registry of Open Data on AWS—a searchable directory of datasets hosted on Amazon Web Services.
  • Amazon Customer Reviews Dataset—130+ million product reviews.
  • Pitt.edu corpora—MPQA Opinion Corpus, Product Debate Data, Political Debate Data, goodFor/badFor Corpus, Arguing Corpus
  • and many more!

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.