• Instructor Notes:
    • Before running this notebook's code, add your own keys into the keys.py file
    • The app is set up to collect 10,000 tweets—you can reduce the number to make it execute during class time, but the map will color better if there are some significant differences in tweet counts per senator. So, you might want to execute this the night before your class.
    • We've provided the fully executed notebook, which has a map generated on August 25, 2019.
    • Every time you execute this, you'll want to first go to mongodb.com and empty your database if you're using their free Atlast cluster, as it provides only 512MB of storage.
    • We updated senators.csv as of 2019.
    • We modified class TweetListener to clear it's ouptut every time a new tweet arrives so you don't see 10,000 tweets appear inline in the notebook.
In [1]:
# enable high-res images in notebook 
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

17.4 Case Study: A MongoDB JSON Document Database

  • Store and search JSON for 10,000 streamed tweets about 100 U.S. senators
  • Summarize top 10 by tweet count
  • Display interactive map containing tweet count summaries
  • 10,000 tweets can take substantial time
  • Possible enhancement — Use sentiment analysis to count positive, negative and neutral tweets mentioning each senator’s handle

Free Cloud-Based MongoDB Atlas Cluster

Python Libraries Required for Interacting with MongoDB

conda install -c conda-forge pymongo
conda install -c conda-forge dnspython
  • pymongo library — interact with MongoDB databases from Python
  • dnspython library — used as part of connecting to a MongoDB Atlas Cluster

keys.py

  • keys.py must contain
    • your Twitter credentials
    • your OpenMapQuest key
    • your MongoDB connection string

17.4.1 Creating the MongoDB Atlas Cluster

  • To sign up for a free account go to the following address, enter your email address and click Get started free

    https://mongodb.com

  • On the next page, enter your name and create a password, then read their terms of service.
  • If you agree, click Get started free on this page and you’ll be taken to the screen for setting up your cluster.
  • Click Build my first cluster to get started.
  • They walk you through the getting started steps with popup bubbles that describe and point you to each task you need to complete.
  • They provide default settings for their free Atlas cluster (M0 as they refer to it), so just give your cluster a name in the Cluster Name section, then click Create Cluster.

17.4.1 Creating the MongoDB Atlas Cluster (cont.)

  • They’ll take you to the Clusters page and begin creating your new cluster, which takes several minutes.
  • Next, a Connect to Atlas popup tutorial will appear, showing a checklist of additional steps required to get you up and running:
    • Create your first database user—This enables you to log into your cluster.
    • Whitelist your IP address—This is a security measure which ensures that only IP addresses you verify are allowed to interact with your cluster. To connect to this cluster from multiple locations (school, home, work, etc.), you’ll need to whitelist each IP address from which you intend to connect.
    • Connect to your cluster—In this step, you’ll locate your cluster’s connection string, which will enable your Python code to connect to the server.

Creating Your First Database User

  • In the popup tutorial window, click Create your first database user to continue the tutorial, then follow the on-screen prompts to view the cluster’s Security tab and click + ADD NEW USER.
  • In the Add New User dialog, create a username and password.
  • Write these down—you’ll need them momentarily.
  • Click Add User to return to the Connect to Atlas popup tutorial.

Whitelist Your IP Address

  • In the popup tutorial window, click Whitelist your IP address to continue the tutorial, then follow the on-screen prompts to view the cluster’s IP Whitelist and click + ADD IP ADDRESS.
  • In the Add Whitelist Entry dialog, you can either add your computer’s current IP address or allow access from anywhere, which they do not recommend for production databases, but is OK for learning purposes.
  • Click ALLOW ACCESS FROM ANYWHERE then click Confirm to return to the Connect to Atlas popup tutorial.

Connect to Your Cluster

  • In the popup tutorial window, click Connect to your cluster to continue the tutorial, then follow the on-screen prompts to view the cluster’s Connect to YourClusterName dialog.
  • Connecting to a MongoDB Atlas database from Python requires a connection string.
  • To get your connection string, click Connect Your Application, then click Short SRV connection string.
  • Your connection string will appear below Copy the SRV address.
  • Click COPY to copy the string.
  • Paste this string into the keys.py file as mongo_connection_string’s value.
  • Replace "<PASSWORD>" in the connection string with your password, and replace the database name "test" with "senators", which will be the database name in this example.
  • At the bottom of the Connect to YourClusterName, click Close.
  • You’re now ready to interact with your Atlas cluster.

17.4.2 Streaming Tweets into MongoDB

Use Tweepy to Authenticate with Twitter and Get the API Object

In [2]:
import tweepy, keys
In [3]:
auth = tweepy.OAuthHandler(
    keys.consumer_key, keys.consumer_secret)
auth.set_access_token(keys.access_token, 
    keys.access_token_secret)
In [4]:
api = tweepy.API(auth, wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify=True)               

Loading the Senators’ Data

  • senators.csv (provided in notebook's folder) contains each senator's
    • two-letter state code
    • name
    • party
    • Twitter handle
    • Twitter ID
  • Twitter handle and ID used to track tweets to, from and mentioning each senator
  • When following users via numeric Twitter IDs, must submit IDs as strings

Loading the Senators’ Data (cont.)

In [5]:
import pandas as pd
In [6]:
senators_df = pd.read_csv('senators.csv')
In [7]:
senators_df['TwitterID'] = senators_df['TwitterID'].astype(str)
In [8]:
senators_df.head()
Out[8]:
State Name Party TwitterHandle TwitterID
0 AL Richard Shelby R SenShelby 21111098
1 AL Doug Jomes D SenDougJones 941080085121175552
2 AK Lisa Murkowski R lisamurkowski 18061669
3 AK Dan Sullivan R SenDanSullivan 2891210047
4 AZ Martha McSally R SenMcSallyAZ 2964949642

Configuring the MongoClient

In [9]:
from pymongo import MongoClient
In [10]:
atlas_client = MongoClient(keys.mongo_connection_string)
  • Get pymongo Database object representing the senators database
  • Creates the database if it does not exist
  • Will be used to store the collection of tweet JSON documents
In [11]:
db = atlas_client.senators 

Setting up Tweet Stream

  • TweetListener uses the db object representing the senators database to store tweets
    • Depending on the rate at which people are tweeting about the senators, it may take minutes to hours to get 10,000 tweets
In [12]:
from tweetlistener import TweetListener
In [13]:
tweet_limit = 10000  
In [14]:
twitter_stream = tweepy.Stream(api.auth, 
    TweetListener(api, db, tweet_limit)) 

Starting the Live Tweet Stream

  • Currently, can track up to 400 keywords and follow up to 5,000 Twitter IDs at a time
    • track senators’ Twitter handles as keywords
    • follow their IDs
  • Together, this will get tweets from, to and about each senator
In [15]:
twitter_stream.filter(track=senators_df.TwitterHandle.tolist(),
    follow=senators_df.TwitterID.tolist())
    Screen name: Kathy Buckley
     Created at: Mon Jul 29 15:28:42 +0000 2019
Tweets received: 10000

Class TweetListener

  • For this example, we slightly modified class TweetListener from the “Data Mining Twitter” chapter.
  • Much of the Twitter and Tweepy code shown below is identical to the code you saw previously, so we’ll focus on only the new concepts here:
# tweetlistener.py
"""TweetListener downloads tweets and stores them in MongoDB."""
import json
import tweepy
from IPython.display import clear_output

class TweetListener(tweepy.StreamListener):
    """Handles incoming Tweet stream."""
def __init__(self, api, database, limit=10000):
        """Create instance variables for tracking number of tweets."""
        self.db = database
        self.tweet_count = 0
        self.TWEET_LIMIT = limit  # 10,000 by default
        super().__init__(api)  # call superclass's init

    def on_connect(self):
        """Called when your connection attempt is successful, enabling 
        you to perform appropriate application tasks at that point."""
        print('Successfully connected to Twitter\n')
def on_data(self, data):
        """Called when Twitter pushes a new tweet to you."""
        self.tweet_count += 1  # track number of tweets processed
        json_data = json.loads(data)  # convert string to JSON
        self.db.tweets.insert_one(json_data)  # store in tweets collection
        clear_output()  # ADDED: show one tweet at a time in Jupyter Notebook
        print(f'    Screen name: {json_data["user"]["name"]}') 
        print(f'     Created at: {json_data["created_at"]}')         
        print(f'Tweets received: {self.tweet_count}')         

        # if TWEET_LIMIT is reached, return False to terminate streaming
        return self.tweet_count < self.TWEET_LIMIT

    def on_error(self, status):
        print(status)
        return True

Class TweetListener (cont.)

  • Previously, TweetListener overrode method on_status to receive Tweepy Status objects representing tweets
  • Here, we override the on_data method instead
  • Rather than Status objects, on_data receives each tweet object’s raw JSON
  • We convert the JSON string received by on_data into a Python JSON object
  • Each MongoDB database contains one or more Collections of documents
  • The following expression accesses the Database object db’s tweets Collection, creating it if it does not already exist
    self.db.tweets
    
  • We use the tweets Collection’s insert_one method to store the JSON object in the tweets collection

Counting Tweets for Each Senator

  • MongoDB text search requires a text index specifying document field(s) to search
  • A text index is defined as a tuple containing field name to search and index type ('text')
  • Wildcard field name (\$**) indexes all text fields for a full-text search
In [16]:
db.tweets.create_index([('$**', 'text')])
Out[16]:
'$**_text'

Counting Tweets for Each Senator (cont.)

  • Use tweets Collection’s count_documents method and full-text search to count the total number of documents in the collection that contain the specified text
    • Find every twitter handle in senators_df.TwitterHandle column
    • {"$text": {"$search": senator}} indicates that we’re using the text index to search for the value of senator
In [17]:
tweet_counts = []
In [18]:
for senator in senators_df.TwitterHandle:
    tweet_counts.append(db.tweets.count_documents(
        {"$text": {"$search": senator}}))

Show Tweet Counts for Each Senator

  • Create copy of DataFrame senators_df adding a new column of tweet_counts
  • Display the top-10 senators by tweet count
In [19]:
tweet_counts_df = senators_df.assign(Tweets=tweet_counts)  
In [20]:
tweet_counts_df.sort_values(by='Tweets', ascending=False).head(10)
Out[20]:
State Name Party TwitterHandle TwitterID Tweets
84 TX John Cornyn R JohnCornyn 13218102 1632
33 KY Rand Paul R RandPaul 216881337 1509
12 CT Christopher Murphy D ChrisMurphyCT 150078976 1334
32 KY Mitch McConnell R SenateMajLdr 1249982359 1302
62 NY Chuck Schumer D SenSchumer 17494010 1078
17 FL Marco Rubio R marcorubio 15745368 440
16 FL Rick Scott R SenRickScott 131546062 411
78 SC Lindsey Graham R LindseyGrahamSC 432895323 372
58 NJ Cory Booker D CoryBooker 15808765 363
9 CA Kamala Harris D SenKamalaHarris 803694179079458816 336

Get the State Locations for Plotting Markers

  • Get each state’s latitude and longitude coordinates for plotting on a map
  • state_codes.py contains a dictionary that maps two-letter state codes to full state names
    • Used with geopy to look up the location of each state
In [21]:
from geopy import OpenMapQuest
In [22]:
import time
In [23]:
from state_codes import state_codes
  • Get the geocoder object to translate location names into Location objects
In [24]:
geo = OpenMapQuest(api_key=keys.mapquest_key) 

Get the State Locations for Plotting Markers (cont.)

  • Get and sort the unique state names
In [25]:
states = tweet_counts_df.State.unique()  # get unique state names
In [26]:
states.sort() 

Get the State Locations for Plotting Markers (cont.)

  • Look up each state’s location
  • Call geocode with state name followed by ', USA'
    • Ensures that we get United States locations
In [27]:
locations = []
In [28]:
from IPython.display import clear_output

for state in states:
    processed = False
    delay = .1 
    while not processed:
        try: 
            locations.append(geo.geocode(state_codes[state] + ', USA'))
            clear_output()  # clear cell's current output before showing next one
            print(locations[-1])  
            processed = True
        except:  # timed out, so wait before trying again
            print('OpenMapQuest service timed out. Waiting.')
            time.sleep(delay)
            delay += .1
Wyoming, United States of America

Grouping the Tweet Counts by State

  • Tweet total for a states' two senators is used to color the map
    • Darker colors represent higher tweet counts
  • DataFrame method groupby to group the senators by state
    • as_index=False—state codes should be a column in returned GroupBy object, rather than indices for the object's rows
  • GroupBy object's sum method totals the numeric data by state
In [29]:
tweets_counts_by_state = tweet_counts_df.groupby(
    'State', as_index=False).sum()
In [30]:
tweets_counts_by_state.head()
Out[30]:
State Tweets
0 AK 28
1 AL 15
2 AR 13
3 AZ 31
4 CA 397

Creating the Map

In [31]:
import folium
In [32]:
usmap = folium.Map(location=[39.8283, -98.5795], 
                   zoom_start=4, detect_retina=True,
                   tiles='Stamen Toner')

Creating a Choropleth to Color the Map

In [33]:
choropleth = folium.Choropleth(
    geo_data='us-states.json',
    name='choropleth',
    data=tweets_counts_by_state,
    columns=['State', 'Tweets'],
    key_on='feature.id',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Tweets by State'
).add_to(usmap)

layer = folium.LayerControl().add_to(usmap)

Creating a Choropleth to Color the Map (cont.)

  • geo_data='us-states.json'—This is the file containing the GeoJSON that specifies the shapes to color.
  • name='choropleth'—Folium displays the Choropleth as a layer over the map. This is the name for that layer that will appear in the map’s layer controls, which enable you to hide and show the layers. These controls appear when you click the layers icon () on the map
  • data=tweets_counts_by_state—This is a pandas DataFrame (or Series) containing the values that determine the Choropleth colors

Creating a Choropleth to Color the Map (cont.)

  • columns=['State', 'Tweets']—When the data is a DataFrame, this is a list of two columns representing the keys and the corresponding values used to color the Choropleth.
  • key_on='feature.id'—This is a variable in the GeoJSON file to which the Choropleth binds the values in the columns argument
  • fill_color='YlOrRd'—This is a color map specifying the colors to use to fill in the states. Folium provides 12 colormaps: 'BuGn', 'BuPu', 'GnBu', 'OrRd', 'PuBu', 'PuBuGn', 'PuRd', 'RdPu', 'YlGn', 'YlGnBu', 'YlOrBr' and 'YlOrRd'. You should experiment with these to find the most effective and eye-pleasing ones for your application(s).

Creating a Choropleth to Color the Map (cont.)

  • fill_opacity=0.7—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of the fill colors displayed in the states.
  • line_opacity=0.2—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of lines used to delineate the states.
  • legend_name='Tweets by State'—At the top of the map, the Choropleth displays a color bar (the legend) indicating the value range represented by the colors. This legend_name text appears below the color bar to indicate what the colors represent.
  • Complete list of Choropleth keyword arguments

Creating the Map Markers for Each State (1 of 1)

  • Sort senators in descending order by tweet count
  • groupby maintains original row order in each group
  • index — used to look up each state’s location in locations list
  • name — two-letter state code
  • group — collection of a state's two senators
In [34]:
sorted_df = tweet_counts_df.sort_values(by='Tweets', ascending=False)

for index, (name, group) in enumerate(sorted_df.groupby('State')):
    strings = [state_codes[name]]  # used to assemble popup text

    for s in group.itertuples():
        strings.append(f'{s.Name} ({s.Party}); Tweets: {s.Tweets}')
        
    text = '<br>'.join(strings)  
    popup = folium.Popup(text, max_width=200)
    marker = folium.Marker(
        (locations[index].latitude, locations[index].longitude), 
        popup=popup)
    marker.add_to(usmap) 

Saving and Displaying the Map

  • Instructor note: You also can evaluate the usmap object in a code cell to display the map in a notebook.
In [35]:
usmap.save('SenatorsTweets.html')
In [36]:
from IPython.display import IFrame
IFrame(src="./SenatorsTweets.html", width=800, height=450)
Out[36]:

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.