Instructor Notes:
- Before running this notebook's code, add your own keys into the keys.py file
- The app is set up to collect 10,000 tweets—you can reduce the number to make it execute during class time, but the map will color better if there are some significant differences in tweet counts per senator. So, you might want to execute this the night before your class.
- We've provided the fully executed notebook, which has a map generated on August 25, 2019.
- Every time you execute this, you'll want to first go to mongodb.com and empty your database if you're using their free Atlast cluster, as it provides only 512MB of storage.
- We updated senators.csv as of 2019.
- We modified class TweetListener to clear it's ouptut every time a new tweet arrives so you don't see 10,000 tweets appear inline in the notebook.

# enable high-res images in notebook 
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

17.4 Case Study: A MongoDB JSON Document Database¶

Store and search JSON for 10,000 streamed tweets about 100 U.S. senators
Summarize top 10 by tweet count
Display interactive map containing tweet count summaries
10,000 tweets can take substantial time
Possible enhancement — Use sentiment analysis to count positive, negative and neutral tweets mentioning each senator’s handle

Free Cloud-Based MongoDB Atlas Cluster¶

Requires no installation
Store up to 512MB of data
Can store more with
- Free MongoDB Community Server, or
- Paid MongoDB Atlas account
Creating your MongoDB Atlas cluster
- I discuss the details of signing up for a MongoDB account, creating the MongoDB Atlas Cluster, configuring it and getting your connection string in my Python Fundamentals LiveLessons videos (big data lesson coming soon) and in Python for Programmers, Section 16.4.1

Python Libraries Required for Interacting with MongoDB¶

conda install -c conda-forge pymongo
conda install -c conda-forge dnspython

pymongo library — interact with MongoDB databases from Python
dnspython library — used as part of connecting to a MongoDB Atlas Cluster

keys.py¶

keys.py must contain
- your Twitter credentials
- your OpenMapQuest key
- your MongoDB connection string

17.4.1 Creating the MongoDB Atlas Cluster¶

To sign up for a free account go to the following address, enter your email address and click Get started free
https://mongodb.com
On the next page, enter your name and create a password, then read their terms of service.
If you agree, click Get started free on this page and you’ll be taken to the screen for setting up your cluster.
Click Build my first cluster to get started.
They walk you through the getting started steps with popup bubbles that describe and point you to each task you need to complete.
They provide default settings for their free Atlas cluster (M0 as they refer to it), so just give your cluster a name in the Cluster Name section, then click Create Cluster.

17.4.1 Creating the MongoDB Atlas Cluster (cont.)¶

They’ll take you to the Clusters page and begin creating your new cluster, which takes several minutes.
Next, a Connect to Atlas popup tutorial will appear, showing a checklist of additional steps required to get you up and running:
- Create your first database user—This enables you to log into your cluster.
- Whitelist your IP address—This is a security measure which ensures that only IP addresses you verify are allowed to interact with your cluster. To connect to this cluster from multiple locations (school, home, work, etc.), you’ll need to whitelist each IP address from which you intend to connect.
- Connect to your cluster—In this step, you’ll locate your cluster’s connection string, which will enable your Python code to connect to the server.

Creating Your First Database User¶

In the popup tutorial window, click Create your first database user to continue the tutorial, then follow the on-screen prompts to view the cluster’s Security tab and click + ADD NEW USER.
In the Add New User dialog, create a username and password.
Write these down—you’ll need them momentarily.
Click Add User to return to the Connect to Atlas popup tutorial.

Whitelist Your IP Address¶

In the popup tutorial window, click Whitelist your IP address to continue the tutorial, then follow the on-screen prompts to view the cluster’s IP Whitelist and click + ADD IP ADDRESS.
In the Add Whitelist Entry dialog, you can either add your computer’s current IP address or allow access from anywhere, which they do not recommend for production databases, but is OK for learning purposes.
Click ALLOW ACCESS FROM ANYWHERE then click Confirm to return to the Connect to Atlas popup tutorial.

Connect to Your Cluster¶

In the popup tutorial window, click Connect to your cluster to continue the tutorial, then follow the on-screen prompts to view the cluster’s Connect to YourClusterName dialog.
Connecting to a MongoDB Atlas database from Python requires a connection string.
To get your connection string, click Connect Your Application, then click Short SRV connection string.
Your connection string will appear below Copy the SRV address.
Click COPY to copy the string.
Paste this string into the keys.py file as mongo_connection_string’s value.
Replace "<PASSWORD>" in the connection string with your password, and replace the database name "test" with "senators", which will be the database name in this example.
At the bottom of the Connect to YourClusterName, click Close.
You’re now ready to interact with your Atlas cluster.

17.4.2 Streaming Tweets into MongoDB¶

Use Tweepy to Authenticate with Twitter and Get the API Object¶

import tweepy, keys

auth = tweepy.OAuthHandler(
    keys.consumer_key, keys.consumer_secret)
auth.set_access_token(keys.access_token, 
    keys.access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify=True)

Loading the Senators’ Data¶

senators.csv (provided in notebook's folder) contains each senator's
- two-letter state code
- name
- party
- Twitter handle
- Twitter ID
Twitter handle and ID used to track tweets to, from and mentioning each senator
When following users via numeric Twitter IDs, must submit IDs as strings

Loading the Senators’ Data (cont.)¶

import pandas as pd

senators_df = pd.read_csv('senators.csv')

senators_df['TwitterID'] = senators_df['TwitterID'].astype(str)

senators_df.head()

Configuring the `MongoClient`¶

from pymongo import MongoClient

atlas_client = MongoClient(keys.mongo_connection_string)

Get pymongo Database object representing the senators database
Creates the database if it does not exist
Will be used to store the collection of tweet JSON documents

db = atlas_client.senators

Setting up Tweet Stream¶

TweetListener uses the db object representing the senators database to store tweets
- Depending on the rate at which people are tweeting about the senators, it may take minutes to hours to get 10,000 tweets

from tweetlistener import TweetListener

tweet_limit = 10000

twitter_stream = tweepy.Stream(api.auth, 
    TweetListener(api, db, tweet_limit))

Starting the Live Tweet Stream¶

Currently, can track up to 400 keywords and follow up to 5,000 Twitter IDs at a time
- track senators’ Twitter handles as keywords
- follow their IDs
Together, this will get tweets from, to and about each senator

twitter_stream.filter(track=senators_df.TwitterHandle.tolist(),
    follow=senators_df.TwitterID.tolist())

    Screen name: Kathy Buckley
     Created at: Mon Jul 29 15:28:42 +0000 2019
Tweets received: 10000

Class `TweetListener`¶

For this example, we slightly modified class TweetListener from the “Data Mining Twitter” chapter.
Much of the Twitter and Tweepy code shown below is identical to the code you saw previously, so we’ll focus on only the new concepts here:

# tweetlistener.py
"""TweetListener downloads tweets and stores them in MongoDB."""
import json
import tweepy
from IPython.display import clear_output

class TweetListener(tweepy.StreamListener):
    """Handles incoming Tweet stream."""

def __init__(self, api, database, limit=10000):
        """Create instance variables for tracking number of tweets."""
        self.db = database
        self.tweet_count = 0
        self.TWEET_LIMIT = limit  # 10,000 by default
        super().__init__(api)  # call superclass's init

    def on_connect(self):
        """Called when your connection attempt is successful, enabling 
        you to perform appropriate application tasks at that point."""
        print('Successfully connected to Twitter\n')

def on_data(self, data):
        """Called when Twitter pushes a new tweet to you."""
        self.tweet_count += 1  # track number of tweets processed
        json_data = json.loads(data)  # convert string to JSON
        self.db.tweets.insert_one(json_data)  # store in tweets collection
        clear_output()  # ADDED: show one tweet at a time in Jupyter Notebook
        print(f'    Screen name: {json_data["user"]["name"]}') 
        print(f'     Created at: {json_data["created_at"]}')         
        print(f'Tweets received: {self.tweet_count}')         

        # if TWEET_LIMIT is reached, return False to terminate streaming
        return self.tweet_count < self.TWEET_LIMIT

    def on_error(self, status):
        print(status)
        return True

Class `TweetListener` (cont.)¶

Previously, TweetListener overrode method on_status to receive Tweepy Status objects representing tweets
Here, we override the on_data method instead
Rather than Status objects, on_data receives each tweet object’s raw JSON
We convert the JSON string received by on_data into a Python JSON object
Each MongoDB database contains one or more Collections of documents
The following expression accesses the Database object db’s tweets Collection, creating it if it does not already exist
```
self.db.tweets
```
We use the tweets Collection’s insert_one method to store the JSON object in the tweets collection

Counting Tweets for Each Senator¶

MongoDB text search requires a text index specifying document field(s) to search
- MongoDB index types, text indexes and operators
A text index is defined as a tuple containing field name to search and index type ('text')
Wildcard field name (\$**) indexes all text fields for a full-text search

db.tweets.create_index([('$**', 'text')])

'$**_text'

Counting Tweets for Each Senator (cont.)¶

Use tweets Collection’s count_documents method and full-text search to count the total number of documents in the collection that contain the specified text
- Find every twitter handle in senators_df.TwitterHandle column
- {"$text": {"$search": senator}} indicates that we’re using the text index to search for the value of senator

tweet_counts = []

for senator in senators_df.TwitterHandle:
    tweet_counts.append(db.tweets.count_documents(
        {"$text": {"$search": senator}}))

Show Tweet Counts for Each Senator¶

Create copy of DataFrame senators_df adding a new column of tweet_counts
Display the top-10 senators by tweet count

tweet_counts_df = senators_df.assign(Tweets=tweet_counts)

tweet_counts_df.sort_values(by='Tweets', ascending=False).head(10)

Get the State Locations for Plotting Markers¶

Get each state’s latitude and longitude coordinates for plotting on a map
state_codes.py contains a dictionary that maps two-letter state codes to full state names
- Used with geopy to look up the location of each state

from geopy import OpenMapQuest

import time

from state_codes import state_codes

Get the geocoder object to translate location names into Location objects

geo = OpenMapQuest(api_key=keys.mapquest_key)

Get the State Locations for Plotting Markers (cont.)¶

Get and sort the unique state names

states = tweet_counts_df.State.unique()  # get unique state names

states.sort()

Get the State Locations for Plotting Markers (cont.)¶

Look up each state’s location
Call geocode with state name followed by ', USA'
- Ensures that we get United States locations

locations = []

from IPython.display import clear_output

for state in states:
    processed = False
    delay = .1 
    while not processed:
        try: 
            locations.append(geo.geocode(state_codes[state] + ', USA'))
            clear_output()  # clear cell's current output before showing next one
            print(locations[-1])  
            processed = True
        except:  # timed out, so wait before trying again
            print('OpenMapQuest service timed out. Waiting.')
            time.sleep(delay)
            delay += .1

Wyoming, United States of America

Grouping the Tweet Counts by State¶

Tweet total for a states' two senators is used to color the map
- Darker colors represent higher tweet counts
DataFrame method groupby to group the senators by state
- as_index=False—state codes should be a column in returned GroupBy object, rather than indices for the object's rows
GroupBy object's sum method totals the numeric data by state

tweets_counts_by_state = tweet_counts_df.groupby(
    'State', as_index=False).sum()

tweets_counts_by_state.head()

Creating the Map¶

import folium

usmap = folium.Map(location=[39.8283, -98.5795], 
                   zoom_start=4, detect_retina=True,
                   tiles='Stamen Toner')

Creating a Choropleth to Color the Map¶

A choropleth shades areas in a map using magnitudes of numerical values to determine color
For a detailed description of the arguments below, see my Python Fundamentals LiveLessons videos (big data lesson coming soon) or Python for Programmers, Section 16.4.2 (under the heading "Creating a Choropleth to Color the Map"

choropleth = folium.Choropleth(
    geo_data='us-states.json',
    name='choropleth',
    data=tweets_counts_by_state,
    columns=['State', 'Tweets'],
    key_on='feature.id',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Tweets by State'
).add_to(usmap)

layer = folium.LayerControl().add_to(usmap)

Creating a Choropleth to Color the Map (cont.)¶

geo_data='us-states.json'—This is the file containing the GeoJSON that specifies the shapes to color.
name='choropleth'—Folium displays the Choropleth as a layer over the map. This is the name for that layer that will appear in the map’s layer controls, which enable you to hide and show the layers. These controls appear when you click the layers icon () on the map
data=tweets_counts_by_state—This is a pandas DataFrame (or Series) containing the values that determine the Choropleth colors

Creating a Choropleth to Color the Map (cont.)¶

columns=['State', 'Tweets']—When the data is a DataFrame, this is a list of two columns representing the keys and the corresponding values used to color the Choropleth.
key_on='feature.id'—This is a variable in the GeoJSON file to which the Choropleth binds the values in the columns argument
fill_color='YlOrRd'—This is a color map specifying the colors to use to fill in the states. Folium provides 12 colormaps: 'BuGn', 'BuPu', 'GnBu', 'OrRd', 'PuBu', 'PuBuGn', 'PuRd', 'RdPu', 'YlGn', 'YlGnBu', 'YlOrBr' and 'YlOrRd'. You should experiment with these to find the most effective and eye-pleasing ones for your application(s).

Creating a Choropleth to Color the Map (cont.)¶

fill_opacity=0.7—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of the fill colors displayed in the states.
line_opacity=0.2—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of lines used to delineate the states.
legend_name='Tweets by State'—At the top of the map, the Choropleth displays a color bar (the legend) indicating the value range represented by the colors. This legend_name text appears below the color bar to indicate what the colors represent.
Complete list of Choropleth keyword arguments

Creating the Map Markers for Each State (1 of 1)¶

Sort senators in descending order by tweet count
groupby maintains original row order in each group
index — used to look up each state’s location in locations list
name — two-letter state code
group — collection of a state's two senators

sorted_df = tweet_counts_df.sort_values(by='Tweets', ascending=False)

for index, (name, group) in enumerate(sorted_df.groupby('State')):
    strings = [state_codes[name]]  # used to assemble popup text

    for s in group.itertuples():
        strings.append(f'{s.Name} ({s.Party}); Tweets: {s.Tweets}')
        
    text = '<br>'.join(strings)  
    popup = folium.Popup(text, max_width=200)
    marker = folium.Marker(
        (locations[index].latitude, locations[index].longitude), 
        popup=popup)
    marker.add_to(usmap)

Saving and Displaying the Map¶

Instructor note: You also can evaluate the usmap object in a code cell to display the map in a notebook.

usmap.save('SenatorsTweets.html')

from IPython.display import IFrame
IFrame(src="./SenatorsTweets.html", width=800, height=450)

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	State	Name	Party	TwitterHandle	TwitterID
0	AL	Richard Shelby	R	SenShelby	21111098
1	AL	Doug Jomes	D	SenDougJones	941080085121175552
2	AK	Lisa Murkowski	R	lisamurkowski	18061669
3	AK	Dan Sullivan	R	SenDanSullivan	2891210047
4	AZ	Martha McSally	R	SenMcSallyAZ	2964949642

	State	Name	Party	TwitterHandle	TwitterID	Tweets
84	TX	John Cornyn	R	JohnCornyn	13218102	1632
33	KY	Rand Paul	R	RandPaul	216881337	1509
12	CT	Christopher Murphy	D	ChrisMurphyCT	150078976	1334
32	KY	Mitch McConnell	R	SenateMajLdr	1249982359	1302
62	NY	Chuck Schumer	D	SenSchumer	17494010	1078
17	FL	Marco Rubio	R	marcorubio	15745368	440
16	FL	Rick Scott	R	SenRickScott	131546062	411
78	SC	Lindsey Graham	R	LindseyGrahamSC	432895323	372
58	NJ	Cory Booker	D	CoryBooker	15808765	363
9	CA	Kamala Harris	D	SenKamalaHarris	803694179079458816	336

	State	Tweets
0	AK	28
1	AL	15
2	AR	13
3	AZ	31
4	CA	397

17.4 Case Study: A MongoDB JSON Document Database¶

Free Cloud-Based MongoDB Atlas Cluster¶

Python Libraries Required for Interacting with MongoDB¶

keys.py¶

17.4.1 Creating the MongoDB Atlas Cluster¶

17.4.1 Creating the MongoDB Atlas Cluster (cont.)¶

Creating Your First Database User¶

Whitelist Your IP Address¶

Connect to Your Cluster¶

17.4.2 Streaming Tweets into MongoDB¶

Use Tweepy to Authenticate with Twitter and Get the API Object¶

Loading the Senators’ Data¶

Loading the Senators’ Data (cont.)¶

Configuring the MongoClient¶

Setting up Tweet Stream¶

Starting the Live Tweet Stream¶

Class TweetListener¶

Class TweetListener (cont.)¶

Counting Tweets for Each Senator¶

Counting Tweets for Each Senator (cont.)¶

Show Tweet Counts for Each Senator¶

Get the State Locations for Plotting Markers¶

Get the State Locations for Plotting Markers (cont.)¶

Get the State Locations for Plotting Markers (cont.)¶

Grouping the Tweet Counts by State¶

Creating the Map¶

Creating a Choropleth to Color the Map¶

Creating a Choropleth to Color the Map (cont.)¶

Creating a Choropleth to Color the Map (cont.)¶

Creating a Choropleth to Color the Map (cont.)¶

Creating the Map Markers for Each State (1 of 1)¶

Saving and Displaying the Map¶

Configuring the `MongoClient`¶

Class `TweetListener`¶

Class `TweetListener` (cont.)¶