Instructor Notes:
jupyter/pyspark-notebook
Docker stack, which is configured with Spark and PySpark # enable high-res images in notebook
%config InlineBackend.figure_format = 'retina'
DataFrame
s (similar to pandas DataFrames
), enable you to manipulate RDDs as a collection of named columnsDataFrame
s with Spark SQL to query distributed data"Docker for Windows.exe"
installer to make changes to your system to complete the installation process. To do so, click Yes when Windows asks if you want to allow the installer to make changes to your system. jupyter/pyspark-notebook
Docker stack, which is preconfigured with everything you need to create and test Apache Spark apps on your computer.jupyter/pyspark-notebook
Docker stack.\
with ^
docker run -p 8888:8888 -p 4040:4040 -it --user root \ -v fullPathToTheFolderYouWantToUse:/home/jovyan/work \ jupyter/pyspark-notebook:14fdfbf9cfc1 start.sh jupyter lab
jupyter/pyspark-notebook:14fdfbf9cfc1
":14fdfbf9cfc1"
indicates the specific jupyter/pyspark-notebook
container to download.14fdfbf9cfc1
was the newest version of the container.":14fdfbf9cfc1"
in the command, Docker will download the latest version of the container, which might contain different software versions and might not be compatible with the code you’re trying to execute.docker ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS
NAMES
f54f62b7e6d5 jupyter/pyspark-notebook:14fdfbf9cfc1 "tini -g --
/bin/bash" 2 minutes ago Up 2 minutes 0.0.0.0:8888->8888/tcp
friendly_pascal
NAMES
in the third line is the name that Docker randomly assigned to the running container—friendly_pascal
—the name on your system will differdocker exec -it container_name /bin/bash
conda install -c conda-forge nltk textblob
docker run
, Docker gives you a new instance that does not contain any libraries you installed previously.docker stop your_container_name
docker restart your_container_name
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
SparkContext
object gives you access to Spark’s capabilitiesSparkContext
for you but not the Jupyter Docker stackSparkContext
SparkConf
object setMaster
specifies the Spark cluster’s URLlocal[*]
— Spark is executing on your local
computer *
— Use same number of threads as cores on your computerfrom pyspark import SparkConf
configuration = SparkConf().setAppName('RomeoAndJulietCounter')\
.setMaster('local[*]')
from pyspark import SparkContext
sc = SparkContext(conf=configuration)
SparkContext
using functional-style programming applied to an RDDRDD
representing all words in Romeo and Juliet:from textblob.utils import strip_punc
tokenized = sc.textFile('RomeoAndJuliet.txt')\
.flatMap(lambda line: line.lower().split())\
.map(lambda word: strip_punc(word, all=True))
RDD
with no stop words remaining:filtered = tokenized.filter(lambda word: word not in stop_words)
map
each word to a tuple containing the word and 1
reduceByKey
with the operator
module’s add
function as an argument adds the counts for tuples that contain same key (word
)from operator import add
word_counts = filtered.map(lambda word: (word, 1)).reduceByKey(add)
filtered_counts = word_counts.filter(lambda item: item[1] >= 60)
RDD
's collect
method, Spark from operator import itemgetter
sorted_items = sorted(filtered_counts.collect(),
key=itemgetter(1), reverse=True)
max_len = max([len(word) for word, count in sorted_items])
for word, count in sorted_items:
print(f'{word:>{max_len}}: {count}')
# terminate current SparkContext so we can create another for next example
sc.stop()
ssh
to log into your cluster (as shown earlier) and execute the command:/usr/bin/anaconda/envs/py35/bin/conda list
libraries
and for the Bash script URI use:
http://deitel.com/bookresources/IntroToPython/install_libraries.sh
RomeoAndJuliet.txt
to the HDInsight Cluster¶scp
to upload RomeoAndJuliet.txt
scp RomeoAndJuliet.txt sshuser@YourClusterName-ssh.azurehdinsight.net:
RomeoAndJuliet.txt
to the HDInsight Cluster (cont.)¶ssh
to log into your cluster and access its command linessh sshuser@YourClusterName-ssh.azurehdinsight.net
RomeoAndJuliet.txt
file in Spark, use ssh
to copy the file into the cluster’s Hadoop’s file system by executing the following commandhadoop fs -copyFromLocal RomeoAndJuliet.txt /example/data/RomeoAndJuliet.txt
/examples/data
that Microsoft includes for use with HDInsight tutorialsadmin
.PySpark
and Scala
subfoldersRomeoAndJulietCounter.ipynb
notebook and modify it to work with Azure.ch17
example folder’s SparkWordCount
folder, select RomeoAndJulietCounter.ipynb
and click Opennltk.download('stopwords')
as follows to store the stop words in the current folder (`'.'):nltk.download('stopwords', download_dir='.')
Starting Spark application
appears below the cell while HDInsight sets up a SparkContext
object named sc
for youimport
statement to tell NLTK to search for its data in the current folder:nltk.data.path.append('.')
SparkContext
object for you, the third and fourth cells of the original notebook are not needed, so you can delete themdd
RomeoAndJuliet.txt
in the underlying Hadoop file system'RomeoAndJuliet.txt'
with the string 'wasb:///example/data/RomeoAndJuliet.txt'
wasb:///
indicates that RomeoAndJuliet.txt
is stored in a Windows Azure Storage Blob (WASB)—Azure’s interface to the HDFS file systemformat
:print('{:>{width}}: {}'.format(word, count, width=max_len))
https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-portal
©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.