# enable high-res images in notebook
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
conda install -c conda-forge pymongo
conda install -c conda-forge dnspython
pymongo
library — interact with MongoDB databases from Pythondnspython
library — used as part of connecting to a MongoDB Atlas Clusterkeys.py
must contain keys.py
file as mongo_connection_string
’s value."<PASSWORD>"
in the connection string with your password, and replace the database name "test"
with "senators"
, which will be the database name in this example.import tweepy, keys
auth = tweepy.OAuthHandler(
keys.consumer_key, keys.consumer_secret)
auth.set_access_token(keys.access_token,
keys.access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
senators.csv
(provided in notebook's folder) contains each senator's import pandas as pd
senators_df = pd.read_csv('senators.csv')
senators_df['TwitterID'] = senators_df['TwitterID'].astype(str)
senators_df.head()
MongoClient
¶from pymongo import MongoClient
atlas_client = MongoClient(keys.mongo_connection_string)
pymongo
Database
object representing the senators
databasedb = atlas_client.senators
TweetListener
uses the db
object representing the senators database to store tweets from tweetlistener import TweetListener
tweet_limit = 10000
twitter_stream = tweepy.Stream(api.auth,
TweetListener(api, db, tweet_limit))
track
senators’ Twitter handles as keywordsfollow
their IDs twitter_stream.filter(track=senators_df.TwitterHandle.tolist(),
follow=senators_df.TwitterID.tolist())
TweetListener
¶TweetListener
from the “Data Mining Twitter” chapter.# tweetlistener.py
"""TweetListener downloads tweets and stores them in MongoDB."""
import json
import tweepy
from IPython.display import clear_output
class TweetListener(tweepy.StreamListener):
"""Handles incoming Tweet stream."""
def __init__(self, api, database, limit=10000):
"""Create instance variables for tracking number of tweets."""
self.db = database
self.tweet_count = 0
self.TWEET_LIMIT = limit # 10,000 by default
super().__init__(api) # call superclass's init
def on_connect(self):
"""Called when your connection attempt is successful, enabling
you to perform appropriate application tasks at that point."""
print('Successfully connected to Twitter\n')
def on_data(self, data):
"""Called when Twitter pushes a new tweet to you."""
self.tweet_count += 1 # track number of tweets processed
json_data = json.loads(data) # convert string to JSON
self.db.tweets.insert_one(json_data) # store in tweets collection
clear_output() # ADDED: show one tweet at a time in Jupyter Notebook
print(f' Screen name: {json_data["user"]["name"]}')
print(f' Created at: {json_data["created_at"]}')
print(f'Tweets received: {self.tweet_count}')
# if TWEET_LIMIT is reached, return False to terminate streaming
return self.tweet_count < self.TWEET_LIMIT
def on_error(self, status):
print(status)
return True
TweetListener
(cont.)¶TweetListener
overrode method on_status
to receive Tweepy Status
objects representing tweetson_data
method instead Status
objects, on_data
receives each tweet object’s raw JSONon_data
into a Python JSON objectCollections
of documentsDatabase
object db
’s tweets Collection
, creating it if it does not already existself.db.tweets
tweets Collection
’s insert_one
method to store the JSON object in the tweets
collection'text'
)db.tweets.create_index([('$**', 'text')])
tweets
Collection
’s count_documents
method and full-text search to count the total number of documents in the collection that contain the specified textsenators_df.TwitterHandle
column{"$text": {"$search": senator}}
indicates that we’re using the text
index to search
for the value of senator
tweet_counts = []
for senator in senators_df.TwitterHandle:
tweet_counts.append(db.tweets.count_documents(
{"$text": {"$search": senator}}))
DataFrame
senators_df
adding a new column of tweet_counts
tweet_counts_df = senators_df.assign(Tweets=tweet_counts)
tweet_counts_df.sort_values(by='Tweets', ascending=False).head(10)
state_codes.py
contains a dictionary that maps two-letter state codes to full state namesgeopy
to look up the location of each statefrom geopy import OpenMapQuest
import time
from state_codes import state_codes
geocoder
object to translate location names into Location
objectsgeo = OpenMapQuest(api_key=keys.mapquest_key)
states = tweet_counts_df.State.unique() # get unique state names
states.sort()
geocode
with state name followed by ', USA'
locations = []
from IPython.display import clear_output
for state in states:
processed = False
delay = .1
while not processed:
try:
locations.append(geo.geocode(state_codes[state] + ', USA'))
clear_output() # clear cell's current output before showing next one
print(locations[-1])
processed = True
except: # timed out, so wait before trying again
print('OpenMapQuest service timed out. Waiting.')
time.sleep(delay)
delay += .1
DataFrame
method groupby
to group the senators by state as_index=False
—state codes should be a column in returned GroupBy
object, rather than indices for the object's rowsGroupBy
object's sum
method totals the numeric data by statetweets_counts_by_state = tweet_counts_df.groupby(
'State', as_index=False).sum()
tweets_counts_by_state.head()
import folium
usmap = folium.Map(location=[39.8283, -98.5795],
zoom_start=4, detect_retina=True,
tiles='Stamen Toner')
choropleth = folium.Choropleth(
geo_data='us-states.json',
name='choropleth',
data=tweets_counts_by_state,
columns=['State', 'Tweets'],
key_on='feature.id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Tweets by State'
).add_to(usmap)
layer = folium.LayerControl().add_to(usmap)
geo_data='us-states.json'
—This is the file containing the GeoJSON that specifies the shapes to color.name='choropleth'
—Folium displays the Choropleth
as a layer over the map. This is the name for that layer that will appear in the map’s layer controls, which enable you to hide and show the layers. These controls appear when you click the layers icon () on the mapdata=tweets_counts_by_state
—This is a pandas DataFrame
(or Series
) containing the values that determine the Choropleth
colorscolumns=['State', 'Tweets']
—When the data
is a DataFrame
, this is a list of two columns representing the keys and the corresponding values used to color the Choropleth
. key_on='feature.id'
—This is a variable in the GeoJSON file to which the Choropleth
binds the values in the columns
argumentfill_color='YlOrRd'
—This is a color map specifying the colors to use to fill in the states. Folium provides 12 colormaps: 'BuGn'
, 'BuPu'
, 'GnBu'
, 'OrRd'
, 'PuBu'
, 'PuBuGn'
, 'PuRd'
, 'RdPu'
, 'YlGn'
, 'YlGnBu'
, 'YlOrBr'
and 'YlOrRd'
. You should experiment with these to find the most effective and eye-pleasing ones for your application(s).fill_opacity=0.7
—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of the fill colors displayed in the states.line_opacity=0.2
—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of lines used to delineate the states.legend_name='Tweets by State'
—At the top of the map, the Choropleth
displays a color bar (the legend) indicating the value range represented by the colors. This legend_name
text appears below the color bar to indicate what the colors represent.Choropleth
keyword argumentsgroupby
maintains original row order in each groupindex
— used to look up each state’s location in locations
listgroup
— collection of a state's two senatorssorted_df = tweet_counts_df.sort_values(by='Tweets', ascending=False)
for index, (name, group) in enumerate(sorted_df.groupby('State')):
strings = [state_codes[name]] # used to assemble popup text
for s in group.itertuples():
strings.append(f'{s.Name} ({s.Party}); Tweets: {s.Tweets}')
text = '<br>'.join(strings)
popup = folium.Popup(text, max_width=200)
marker = folium.Marker(
(locations[index].latitude, locations[index].longitude),
popup=popup)
marker.add_to(usmap)
usmap
object in a code cell to display the map in a notebook.usmap.save('SenatorsTweets.html')
from IPython.display import IFrame
IFrame(src="./SenatorsTweets.html", width=800, height=450)
©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.