https://bigdata-madesimple.com/research-papers-that-changed-the-world-of-big-data/.
MapReduce
and YARN¶MapReduce
and YARN (cont.)¶https://flume.apache.org
)—A service for collecting and storing (in HDFS and other storage) streaming event data, like high-volume server logs, IoT messages and more. https://hbase.apache.org
)—A NoSQL database for big data with "billions of rows by millions of columns—atop clusters of commodity hardware." (We used the word “by” to replace “X” in the original quote.)https://hive.apache.org
)—Uses SQL to interact with data in data warehouses. A data warehouse aggregates data of various types from various sources. Common operations include extracting data, transforming it and loading (known as ETL) into another database, typically so you can analyze it and create reports from it.https://impala.apache.org
)—A database for real-time SQL-based queries across distributed data stored in Hadoop HDFS or HBase. https://kafka.apache.org
)—Real-time messaging, stream processing and storage, typically to transform and process high-volume streaming data, such as website activity and streaming IoT data.https://pig.apache.org
)—A scripting platform that converts data analysis tasks from a scripting language called Pig Latin into MapReduce tasks. https://sqoop.apache.org
)—Tool for moving structured, semi-structured and unstructured data between databases.https://storm.apache.org
)—A real-time stream-processing system for tasks such as data analytics, machine learning, ETL and more. https://zookeeper.apache.org
)—A service for managing cluster configurations and coordination between clusters.RomeoAndJuliet.txt
(from the “Natural Language Processing” chapter), then summarize how many words of each length there are.For more information, see:
https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-portal
- For Azure-related documentation and videos, visit:
- https://docs.microsoft.com/en-us/azure/—the Azure documentation.
- https://channel9.msdn.com/—Microsoft’s Channel 9 video network.
- https://www.youtube.com/user/windowsazure—Microsoft’s Azure channel on YouTube.
2
.1
key
\t
value
length_mapper.py
, #!
tells Hadoop to use Python 3#!/usr/bin/env python3
# length_mapper.py
"""Maps lines of text to key-value pairs of word lengths and 1."""
import sys
def tokenize_input(): # generator function
"""Split each line of standard input into a list of strings."""
for line in sys.stdin:
yield line.split()
# read each line in the the standard input and for every word
# produce a key-value pair containing the word, a tab and 1
for line in tokenize_input():
for word in line:
print(str(len(word)) + '\t1')
length_reducer.py
, function tokenize_input
is a generator function that reads and splits the key–value pairs produced by the mappergroupby
function (itertools
module) groups inputs by their keys (the word lengths) #!/usr/bin/env python3
# length_reducer.py
"""Counts the number of words with each length."""
import sys
from itertools import groupby
from operator import itemgetter
def tokenize_input():
"""Split each line of standard input into a key and a value."""
for line in sys.stdin:
yield line.strip().split('\t')
# produce key-value pairs of word lengths and counts separated by tabs
for word_length, group in groupby(tokenize_input(), itemgetter(0)):
try:
total = sum(int(count) for word_length, count in group)
print(word_length + '\t' + str(total))
except ValueError:
pass # ignore word if its count was not an integer
RomeoAndJuliet.txt
file.ch16
examples folder, so be sure to copy your RomeoAndJuliet.txt
file to this folder first. scp length_mapper.py length_reducer.py RomeoAndJuliet.txt sshuser@YourClusterName-ssh.azurehdinsight.net:
RomeoAndJuliet.txt
and supply the lines of text to your mapper, you must first copy the file into Hadoop’s file system.ssh
to log into your cluster and access its command line. ssh
. ssh sshuser@YourClusterName-ssh.azurehdinsight.net
/examples/data
that the cluster provides for use with Microsoft’s Azure Hadoop tutorials.hadoop fs -copyFromLocal RomeoAndJuliet.txt
/example/data/RomeoAndJuliet.txt
RomeoAndJuliet.txt
on your cluster by executing the following command in the clusteryarn.txt
located with this exampleyarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
-D mapred.output.key.comparator.class=
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options=-n
-files length_mapper.py,length_reducer.py
-mapper length_mapper.py
-reducer length_reducer.py
-input /example/data/RomeoAndJuliet.txt
-output /example/wordlengthsoutput
yarn
command invokes Hadoop’s YARN (“yet another resource negotiator”) tool to manage and coordinate access to the Hadoop resources the MapReduce task useshadoop-streaming.jar
contains the Java-based Hadoop streaming utility that allows you to use Python to implement the mapper and reducer-D
options set Hadoop properties that enable it to KeyFieldBasedComparator
) -n
) rather than alphabetically-files
—Comma-separated list of scripts that Hadoop copies to every node in the cluster so they can execute locally on each node-mapper
—mapper’s script file-reducer
—reducer’s script file-input
—File or directory of files to supply as mapper input-output
—HDFS directory where final results will be stored...
to save space 1
source of input in this example is RomeoAndJuliet.txt
2
in this example, based on the number of worker nodes in our HDInsight clusterFile System Counters
showing numbers of bytes read and writtenJob Counters
showing the numbers of mapping and reduction tasks used Map-Reduce Framework
showing stats about the steps performedpackageJobJar: [] [/usr/hdp/2.6.5.3004-13/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.3004-13.jar] /tmp/streamjob2764990629848702405.jar tmpDir=null ... 18/12/05 16:46:25 INFO mapred.FileInputFormat: Total input paths to process : 1 18/12/05 16:46:26 INFO mapreduce.JobSubmitter: number of splits:2 ... 18/12/05 16:46:26 INFO mapreduce.Job: The url to track the job: http://hn0-paulte.y3nghy5db2kehav5m0opqrjxcb.cx.internal.cloudapp.net:8088/proxy/application_1543953844228_0025/ ... 18/12/05 16:46:35 INFO mapreduce.Job: map 0% reduce 0% 18/12/05 16:46:43 INFO mapreduce.Job: map 50% reduce 0% 18/12/05 16:46:44 INFO mapreduce.Job: map 100% reduce 0% 18/12/05 16:46:48 INFO mapreduce.Job: map 100% reduce 100% 18/12/05 16:46:50 INFO mapreduce.Job: Job job_1543953844228_0025 completed successfully
18/12/05 16:46:50 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=156411 FILE: Number of bytes written=813764 ... Job Counters Launched map tasks=2 Launched reduce tasks=1 ... Map-Reduce Framework Map input records=5260 Map output records=25956 Map output bytes=104493 Map output materialized bytes=156417 Input split bytes=346 Combine input records=0 Combine output records=0 Reduce input groups=19 Reduce shuffle bytes=156417 Reduce input records=25956 Reduce output records=19 Spilled Records=51912 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=193 CPU time spent (ms)=4440 Physical memory (bytes) snapshot=1942798336 Virtual memory (bytes) snapshot=8463282176 Total committed heap usage (bytes)=3177185280 ... 18/12/05 16:46:50 INFO streaming.StreamJob: Output directory: /example/wordlengthsoutput
hdfs dfs -text /example/wordlengthsoutput/part-00000
18/12/05 16:47:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
18/12/05 16:47:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev b5efb3e531bc1558201462b8ab15bb412ffa6b89]
1 4699
2 3869
3 5651
4 3668
5 2719
6 1624
7 1140
8 1062
9 855
10 317
11 189
12 95
13 35
14 13
15 9
16 6
17 3
18 1
23 1
©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.