17.3 NoSQL and NewSQL Big-Data Databases: A Brief Tour¶

For decades, RDBMSs have been the standard in data processing.
Require structured data that fits into neat rectangular tables.
As the size of the data and the number of tables and relationships increases, relational databases become more difficult to manipulate efficiently.
In big data, NoSQL and NewSQL databases have emerged to deal with the kinds of data storage and processing demands that traditional relational databases cannot meet.
Big data requires massive databases, often spread across data centers worldwide in huge clusters of commodity computers.
According to statista.com, there are currently over 8 million data centers worldwide.

17.3 NoSQL and NewSQL Big-Data Databases: A Brief Tour (cont.)¶

NoSQL originally meant what its name implies.
Growing importance of SQL in big data—such as SQL on Hadoop and Spark SQL
- NoSQL now is said to stand for “Not Only SQL.”
Meant for
- Unstructured data, like photos, videos and the natural language found in e-mails, text messages and social-media posts, and semi-structured data like JSON and XML documents.
- Semi-structured data often wraps unstructured data with additional information called metadata

17.3 NoSQL and NewSQL Big-Data Databases: A Brief Tour (cont.)¶

Metadata adds structure to the unstructured video data, making it semi-structured (like Tweet JSON, shown previously).
The next several subsections overview the four NoSQL database categories
- key–value
- document
- columnar (also called column-based)
- graph

17.3.1 NoSQL Key–Value Databases¶

Like Python dictionaries, key–value databases store key–value pairs
Otimized for distributed systems and big-data processing.
For reliability, they tend to replicate data in multiple cluster nodes.
Some key–value databases, such as Redis, are implemented in memory for performance, and others store data on disk, such as HBase, which runs on top of Hadoop’s HDFS distributed file system.
Other popular key–value databases include Amazon DynamoDB, Google Cloud Datastore and Couchbase.
DynamoDB and Couchbase are multi-model databases that also support documents.
HBase is also a column-oriented database.

17.3.2 NoSQL Document Databases¶

A document database stores semi-structured data, such as JSON or XML documents.
In document databases, you typically add indexes for specific attributes, so you can more efficiently locate and manipulate documents.
- Assume you’re storing JSON documents produced by IoT devices and each document contains a type attribute.
- You might add an index for this attribute so you can filter documents based on their types.
- Without indexes, you can still perform that task, it will just be slower because you have to search each document in its entirety to find the attribute.

17.3.2 NoSQL Document Databases (cont.)¶

Most popular document database (and most popular overall NoSQL database) is MongoDB
- Name derives from a sequence of letters embedded in the word “humongous.”
We’ll store a large number of tweets in MongoDB for processing.
- Recall that Twitter’s APIs return tweets in JSON format, so they can be stored directly in MongoDB.
- After obtaining the tweets we’ll summarize them in a pandas DataFrame and on a Folium map.
Other popular document databases include Amazon DynamoDB (also a key–value database), Microsoft Azure Cosmos DB and Apache CouchDB.

17.3.3 NoSQL Columnar Databases¶

In a relational database, a common query operation is to get a specific column’s value for every row.
Because data is organized into rows, a query that selects a specific column can perform poorly.
The database system must get every matching row, locate the required column and discard the rest of the row’s information.
A columnar database, also called a column-oriented database, is similar to a relational database, but it stores structured data in columns rather than rows.
- https://en.wikipedia.org/wiki/Columnar_database
- https://www.predictiveanalyticstoday.com/top-wide-columnar-store-databases/

17.3.3 NoSQL Columnar Databases (cont.)¶

Because all of a column’s elements are stored together, selecting all the data for a given column is more efficient.

Consider our authors table in the books database:

      first    last
id                   
1        Paul  Deitel
2      Harvey  Deitel
3       Abbey  Deitel
4         Dan   Quirk
5   Alexander    Wald

In a relational database, all the data for a row is stored together.

17.3.3 NoSQL Columnar Databases (cont.)¶

If we consider each row as a Python tuple, the rows would be represented as (1, 'Paul', 'Deitel'), (2, 'Harvey', 'Deitel'), etc.
In a columnar database, all the values for a given column would be stored together, as in (1, 2, 3, 4, 5), ('Paul', 'Harvey', 'Abbey', 'Dan', 'Alexander') and ('Deitel', 'Deitel', 'Deitel', 'Quirk', 'Wald').
The elements in each column are maintained in row order, so the value at a given index in each column belongs to the same row.
Popular columnar databases include MariaDB ColumnStore and HBase.

17.3.4 NoSQL Graph Databases¶

A graph models relationships between objects.
The objects are called nodes (or vertices) and the relationships are called edges.
Edges are directional.
For example, an edge representing an airline flight points from the origin city to the destination city, but not the reverse.
A graph database stores nodes, edges and their attributes.

17.3.4 NoSQL Graph Databases (2 of x)¶

If you use social networks, like Instagram, Snapchat, Twitter and Facebook, consider your social graph, which consists of the people you know (nodes) and the relationships between them (edges).
Every person has their own social graph, and these are interconnected.
The famous “six degrees of separation” problem says that any two people in the world are connected to one another by following a maximum of six edges in the worldwide social graph.

17.3.4 NoSQL Graph Databases (3 of x)¶

Many companies use graph databases to create recommendation engines.
- When you browse a product on Amazon, they use a graph of users and products to show you comparable products people browsed before making a purchase.
- When you browse movies on Netflix, they use a graph of users and movies they liked to suggest movies that might be of interest to you.
One of the most popular graph databases is Neo4j.
Real-world use-cases for graph databases
- With most of the use-cases, sample graph diagrams produced by Neo4j are shown.
- These visualize the relationships between the graph nodes.
- Check out Neo4j’s free PDF book, Graph Databases.

17.3.5 NewSQL Databases¶

Key advantages of relational databases include their security and transaction support.
In particular, relational databases typically use ACID (Atomicity, Consistency, Isolation, Durability) transactions:
- Atomicity ensures that the database is modified only if all of a transaction’s steps are successful. If you go to an ATM to withdraw $100, that money is not removed from your account unless you have enough money to cover the withdrawal and there is enough money in the ATM to satisfy your request.
- Consistency ensures that the database state is always valid. In the withdrawal example above, your new account balance after the transaction will reflect precisely what you withdrew from your account (and possibly ATM fees).
- Isolation ensures that concurrent transactions occur as if they were performed sequentially. For example, if two people share a joint bank account and both attempt to withdraw money at the same time from two separate ATMs, one transaction must wait until the other completes.
- Durability ensures that changes to the database survive even hardware failures.

17.3.5 NewSQL Databases¶

If you research benefits and disadvantages of NoSQL databases, you’ll see that NoSQL databases generally do not provide ACID support.
The types of applications that use NoSQL databases typically do not require the guarantees that ACID-compliant databases provide.
Many NoSQL databases typically adhere to the BASE (Basic Availability, Soft-state, Eventual consistency) model, which focuses more on the database’s availability.
Whereas, ACID databases guarantee consistency when you write to the database, BASE databases provide consistency at some later point in time.
NewSQL databases blend the benefits of both relational and NoSQL databases for big-data processing tasks.
Some popular NewSQL databases include VoltDB, MemSQL, Apache Ignite and Google Spanner.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.