NOTE: Before running this notebook, place a copy of your downloaded RomeoAndJuliet.txt file in the same folder with this notebook.

12.3 Visualizing Word Frequencies with Bar Charts and Word Clouds¶

Can enhance your corpus analyses
- A bar chart quantitatively visualizes the top 20 words in Romeo and Juliet as bars representing each word and its frequency.
- A word cloud qualitatively visualizes more frequently occurring words in larger fonts and less frequently occurring words in smaller fonts.

12.3.1 Visualizing Word Frequencies with Pandas¶

Visualize Romeo and Juliet’s top 20 words that are not stop words, using features from TextBlob, NLTK and pandas.
Pandas visualization capabilities are based on Matplotlib, so launch IPython with the following command for this session:
```
ipython --matplotlib
```
Or enable matplotlib in Jupyter

%matplotlib inline

Loading the Data¶

from pathlib import Path

from textblob import TextBlob

blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())

Load NLTK stop words

from nltk.corpus import stopwords

stop_words = stopwords.words('english')

Getting the Word Frequencies¶

Get word frequency tuples

items = blob.word_counts.items()

Eliminating the Stop Words¶

The expression item[0] gets the word from each tuple so we can check whether it’s in stop_words

items = [item for item in items if item[0] not in stop_words]

Sorting the Words by Frequency¶

Sort the tuples in items in descending order by frequency
To specify the tuple element to sort by, use the itemgetter function from the Python Standard Library’s operator module

from operator import itemgetter

sorted_items = sorted(items, key=itemgetter(1), reverse=True)

Getting the Top 20 Words¶

TextBlob tokenizaton splits all contractions at their apostrophes and counts the total number of apostrophes as one of the “words”
Romeo and Juliet has many contractions
- If you display sorted_items[0], you’ll see that they are the most frequently occurring “word” with 867 of them
- (In some locales this does not happen and element 0 is indeed 'romeo')
- We ignore element 0

top20 = sorted_items[1:21]

Convert top20 to a DataFrame¶

import pandas as pd

df = pd.DataFrame(top20, columns=['word', 'count'])

df

Visualizing the DataFrame¶

bar method of the DataFrame’s plot property creates and displays a Matplotlib bar chart

axes = df.plot.bar(x='word', y='count', legend=False)

import matplotlib.pyplot as plt

plt.gcf().tight_layout()

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	word	count
0	romeo	315
1	thou	278
2	juliet	190
3	thy	170
4	capulet	163
5	nurse	149
6	love	148
7	thee	138
8	lady	117
9	shall	110
10	friar	105
11	come	94
12	mercutio	88
13	lawrence	82
14	good	80
15	benvolio	79
16	tybalt	79
17	enter	75
18	go	75
19	night	73