17.1 Introduction

  • Discuss popular hardware and software infrastructure for working with big data
  • Develop complete applications on several desktop and cloud-based big-data platforms

Databases

  • Critical big-data infrastructure for storing and manipulating massive amounts of data
  • Critical for securely and confidentially maintaining that data, especially in the context of ever-stricter privacy laws such as
    • HIPAA (Health Insurance Portability and Accountability Act) in the United States
    • GDPR (General Data Protection Regulation) for the European Union

Databases (cont.)

  • Use Structured Query Language (SQL) to manipulate structured data in relational databases
  • Most data produced today is unstructured data
    • Facebook posts
    • Twitter tweets
  • Or semi-structured data
    • JSON documents (like tweets + their metadata)
    • XML documents
  • Relational databases are not geared to the unstructured and semi-structured data in big-data applications

Databases (cont.)

  • New kinds of databases created to handle big data efficiently
  • Four major types of NoSQL databases
    • key–value
    • document
    • columnar
    • graph databases
  • Overview NewSQL databases
    • Blend the benefits of relational and NoSQL databases

Apache Hadoop

  • Much of today’s data is so large that it cannot fit on one system
  • As big data grew, we needed distributed data storage and parallel processing capabilities to process the data more efficiently
  • Led to technologies like Apache Hadoop for distributed data processing with massive parallelism among clusters of computers
    • Intricate parallelization details are handled for you automatically and correctly
  • Introduce Hadoop, its architecture and how it’s used in big-data applications
  • Guide you through configuring a multi-node Hadoop cluster using the Microsoft Azure HDInsight cloud service, then use it to execute a Hadoop MapReduce job implemented in Python
  • HDInsight is not free
    • Microsoft gives you a new-account credit that should enable you to run the chapter’s code examples without incurring additional charges

Apache Spark

  • As big-data processing needs grow, technology community is continually looking for ways to increase performance
  • Hadoop breaks tasks into pieces that do lots of disk I/O across many computers
  • Spark performs certain big-data tasks in memory for better performance
  • Discuss Apache Spark, its architecture and how it’s used in high-performance, real-time big-data applications
  • Implement a Spark application using functional-style filter/map/reduce programming capabilities

Apache Spark (cont.)

  • First, you’ll build this example using a Jupyter Docker stack that runs locally on your desktop computer, then you’ll implement it using a cloud-based Microsoft Azure HDInsight multi-node Spark cluster
  • Introduce Spark streaming for processing streaming data in mini-batches
    • Gathers data for a short time interval, then gives you that batch of data to process
    • Implement a Spark streaming application that processes tweets
    • Use Spark SQL to query data stored in a Spark DataFrame which, unlike pandas DataFrames, may contain data distributed over many computers in a cluster

Internet of Things

  • Conclude with an introduction to the Internet of Things (IoT)
    • Billions of devices that are continuously producing data worldwide
  • Introduce the publish/subscribe model that IoT and other applications use to connect data users with data providers
  • Without writing any code, you’ll build a web-based dashboard using Freeboard.io and a sample live stream from the PubNub messaging service
  • You’ll simulate an Internet-connected thermostat which publishes messages to the free Dweet.io messaging service using the Python module Dweepy, then create a dashboard visualization of the data with Freeboard.io
  • You’ll build a Python client that subscribes to a sample live stream from the PubNub service and dynamically visualizes the stream with Seaborn and a Matplotlib FuncAnimation

Experience Cloud and Desktop Big-Data Software

  • Cloud vendors focus on service-oriented architecture (SOA) technology in which they provide “as-a-Service” capabilities that applications connect to and use in the cloud.
  • Common services provided by cloud vendors include:
“As-a-Service” acronyms (note that several are the same)
Big data as a Service (BDaaS)
Hadoop as a Service (HaaS)
Hardware as a Service (HaaS)
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Storage as a Service (SaaS)
Spark as a Service (SaaS)
  • See more “as-a-Service” acronyms here and here

Experience Cloud and Desktop Big-Data Software (cont.)

  • Hands-on experience in this chapter with several cloud-based tools
    • A free MongoDB Atlas cloud-based cluster
    • A multi-node Hadoop cluster running on Microsoft’s Azure HDInsight cloud-based service—for this you’ll use the credit that comes with a new Azure account
    • A free single-node Spark “cluster” running on your desktop computer, using a Jupyter Docker-stack container
    • A multi-node Spark cluster, also running on Microsoft’s Azure HDInsight—for this you’ll continue using your Azure new-account credit)
  • Many other options, including cloud-based services from Amazon Web Services, Google Cloud and IBM Watson, and the free desktop versions of the Hortonworks and Cloudera platforms (there also are cloud-based paid versions of these)
  • Also could try a single-node Spark cluster running on the free cloud-based Databricks Community Edition
    • Spark’s creators founded Databricks

Experience Cloud and Desktop Big-Data Software (cont.)

  • Always check the latest terms and conditions of each service you use
    • Some require you to enable credit-card billing to use their clusters
    • Caution: Once you allocate Microsoft Azure HDInsight clusters (or other vendors’ clusters), they incur costs. When you complete the case studies using services such as Microsoft Azure, be sure to delete your cluster(s) and their other resources (like storage). This will help extend the life of your Azure new-account credit.
  • If you have questions, the best sources for help are the vendor’s support capabilities and forums
    • Also, check sites such as stackoverflow.com—other people may have asked questions about similar problems and received answers from the developer community

Data’s Meaning

  • The previous data-science case studies all focused on AI
  • Here, we focus on the big-data infrastructure that supports AI solutions
  • As the data used with these technologies continues growing exponentially, we want to learn from that data and do so at blazing speed
  • We’ll accomplish these goals with a combination of sophisticated algorithms, hardware, software and networking designs
  • With more data, and especially with big data, machine learning can be even more effective

Big-Data Sources

  • The following articles and sites provide links to hundreds of free big data sources:
Big-data sources
“Awesome-Public-Datasets,” GitHub.com, https://github.com/caesar0301/awesome-public-datasets
“AWS Public Datasets,” https://aws.amazon.com/public-datasets/
“Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018,” by B. Marr, https://www.forbes.com/sites/bernardmarr/2018/02/26/big-data-and-ai-30-amazing-and-free-public-data-sources-for-2018/
“Datasets for Data Mining and Data Science,” http://www.kdnuggets.com/datasets/index.html
“Exploring Open Data Sets,” https://datascience.berkeley.edu/open-data-sets/
“Free Big Data Sources,” Datamics, http://datamics.com/free-big-data-sources/
Hadoop Illuminated, Chapter 16. Publicly Available Big Data Sets, http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html
“List of Public Data Sources Fit for Machine Learning,” https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/
“Open Data,” Wikipedia, https://en.wikipedia.org/wiki/Open_data
“Open Data 500 Companies,” http://www.opendata500.com/us/list/
“Other Interesting Resources/Big Data and Analytics Educational Resources and Research,” B. Marr, http://computing.derby.ac.uk/bigdatares/?page_id=223.
“6 Amazing Sources of Practice Data Sets,” https://www.jigsawacademy.com/6-amazing-sources-of-practice-data-sets/
“20 Big Data Repositories You Should Check Out,” M. Krivanek, http://www.datasciencecentral.com/profiles/blogs/20-free-big-data-sources-everyone-should-check-out
“70+ Websites to Get Large Data Repositories for Free,” http://bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/
“Ten Sources of Free Big Data on Internet,” A. Brown, https://www.linkedin.com/pulse/ten-sources-free-big-data-internet-alan-brown
“Top 20 Open Data Sources,” https://www.linkedin.com/pulse/top-20-open-data-sources-zygimantas-jacikevicius
“We’re Setting Data, Code and APIs Free,” NASA, https://open.nasa.gov/open-data/
“Where Can I Find Large Datasets Open to the Public?” Quora, https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.