Instructor Notes:
jupyter/pyspark-notebook Docker stack, which is configured with Spark and PySpark # enable high-res images in notebook 
%config InlineBackend.figure_format = 'retina'
DataFrames (similar to pandas DataFrames), enable you to manipulate RDDs as a collection of named columnsDataFrames with Spark SQL to query distributed data"Docker for Windows.exe" installer to make changes to your system to complete the installation process. To do so, click Yes when Windows asks if you want to allow the installer to make changes to your system. jupyter/pyspark-notebook Docker stack, which is preconfigured with everything you need to create and test Apache Spark apps on your computer.jupyter/pyspark-notebook Docker stack.\ with ^docker run -p 8888:8888 -p 4040:4040 -it --user root \ -v fullPathToTheFolderYouWantToUse:/home/jovyan/work \ jupyter/pyspark-notebook:14fdfbf9cfc1 start.sh jupyter lab
jupyter/pyspark-notebook:14fdfbf9cfc1":14fdfbf9cfc1" indicates the specific jupyter/pyspark-notebook container to download.14fdfbf9cfc1 was the newest version of the container.":14fdfbf9cfc1" in the command, Docker will download the latest version of the container, which might contain different software versions and might not be compatible with the code you’re trying to execute.docker psCONTAINER ID        IMAGE                                   COMMAND  
           CREATED             STATUS            PORTS             
  NAMES
f54f62b7e6d5        jupyter/pyspark-notebook:14fdfbf9cfc1   "tini -g -- 
/bin/bash"  2 minutes ago      Up 2 minutes      0.0.0.0:8888->8888/tcp
  friendly_pascalNAMES in the third line is the name that Docker randomly assigned to the running container—friendly_pascal—the name on your system will differdocker exec -it container_name /bin/bashconda install -c conda-forge nltk textblobdocker run, Docker gives you a new instance that does not contain any libraries you installed previously.docker stop your_container_name
docker restart your_container_name
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
SparkContext object gives you access to Spark’s capabilitiesSparkContext for you but not the Jupyter Docker stackSparkContextSparkConf object setMaster specifies the Spark cluster’s URLlocal[*] — Spark is executing on your local computer * — Use same number of threads as cores on your computerfrom pyspark import SparkConf
configuration = SparkConf().setAppName('RomeoAndJulietCounter')\
                           .setMaster('local[*]')
from pyspark import SparkContext
sc = SparkContext(conf=configuration)
SparkContext using functional-style programming applied to an RDDRDD representing all words in Romeo and Juliet:from textblob.utils import strip_punc
tokenized = sc.textFile('RomeoAndJuliet.txt')\
              .flatMap(lambda line: line.lower().split())\
              .map(lambda word: strip_punc(word, all=True))
RDD with no stop words remaining:filtered = tokenized.filter(lambda word: word not in stop_words)
map each word to a tuple containing the word and 1reduceByKey with the operator module’s add function as an argument adds the counts for tuples that contain same key (word)from operator import add
word_counts = filtered.map(lambda word: (word, 1)).reduceByKey(add)
filtered_counts = word_counts.filter(lambda item: item[1] >= 60)
RDD's collect method, Spark from operator import itemgetter
sorted_items = sorted(filtered_counts.collect(),
                      key=itemgetter(1), reverse=True)
max_len = max([len(word) for word, count in sorted_items])
for word, count in sorted_items:
    print(f'{word:>{max_len}}: {count}')
# terminate current SparkContext so we can create another for next example
sc.stop()  
ssh to log into your cluster (as shown earlier) and execute the command:/usr/bin/anaconda/envs/py35/bin/conda list
libraries and for the Bash script URI use: 
http://deitel.com/bookresources/IntroToPython/install_libraries.shRomeoAndJuliet.txt to the HDInsight Cluster¶scp to upload RomeoAndJuliet.txt scp RomeoAndJuliet.txt sshuser@YourClusterName-ssh.azurehdinsight.net:
RomeoAndJuliet.txt to the HDInsight Cluster (cont.)¶ssh to log into your cluster and access its command linessh sshuser@YourClusterName-ssh.azurehdinsight.net
RomeoAndJuliet.txt file in Spark, use ssh to copy the file into the cluster’s Hadoop’s file system by executing the following commandhadoop fs -copyFromLocal RomeoAndJuliet.txt /example/data/RomeoAndJuliet.txt
/examples/data that Microsoft includes for use with HDInsight tutorialsadmin.PySpark and Scala subfoldersRomeoAndJulietCounter.ipynb notebook and modify it to work with Azure.ch17 example folder’s SparkWordCount folder, select RomeoAndJulietCounter.ipynb and click Opennltk.download('stopwords') as follows to store the stop words in the current folder (`'.'):nltk.download('stopwords', download_dir='.')
Starting Spark application appears below the cell while HDInsight sets up a SparkContext object named sc for youimport statement to tell NLTK to search for its data in the current folder:nltk.data.path.append('.')
SparkContext object for you, the third and fourth cells of the original notebook are not needed, so you can delete themdd RomeoAndJuliet.txt in the underlying Hadoop file system'RomeoAndJuliet.txt' with the string 'wasb:///example/data/RomeoAndJuliet.txt'
wasb:///indicates that RomeoAndJuliet.txt is stored in a Windows Azure Storage Blob (WASB)—Azure’s interface to the HDFS file systemformat:print('{:>{width}}: {}'.format(word, count, width=max_len))
https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-portal
©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.