17.2 Relational Databases and Structured Query Language (SQL)¶

Databases are crucial, especially for big data
Chapter 9 covered sequential text-file processing, working with data from CSV files and working with JSON
- Both useful when most or all of a file’s data is to be processed
In transaction processing we need to locate and, possibly, update an individual data item quickly

17.2 Relational Databases and Structured Query Language (SQL) (cont.)¶

A database is an integrated collection of data
A database management system (DBMS) provides mechanisms for storing and organizing data
- Allows for rapid access and storage of data without concern for the internal representation of databases
Relational database management systems (RDBMSs) store data in tables and define relationships among the tables
Structured Query Language (SQL)—which we pronounce as “see-quel”—is used almost universally with relational database systems to manipulate data and perform queries**, which request information that satisfies given criteria
Popular open-source RDBMSs include SQLite, PostgreSQL, MariaDB and MySQL
- These can be downloaded and used freely by anyone
- All have support for Python
Here we use SQLite, which is bundled with Python
Popular proprietary RDBMSs include Microsoft SQL Server, Oracle, Sybase and IBM Db2

Tables, Rows and Columns¶

Relational database — logical table-based representation of data
Employee table that might be used in a personnel system to store employees’ attributes:

Tables, Rows and Columns (cont.)¶

Table rows each describe a single entity
- Here, each row represents one employee
Columns contain individual attribute values
Primary key—a column (or group of columns) with a value that’s unique for each row, so each row can be identified by its primary key
Examples of primary keys
- Social Security numbers
- employee ID numbers
- part numbers in an inventory system

Selecting Data Subsets¶

Structured Query Language (SQL) queries specify subsets of the data to select from a table
- E.g., select from the Employee table each department and location, and present the data in increasing order by department number

Selected data showing where each department is located, sorted in increasing order by department number

SQLite¶

Open-source database management system included with Python
Most popular database systems have Python support

17.2.1 A books Database¶

books database contains information about several of our books
Will set up the database in SQLite via the Python Standard Library’s sqlite3 module, using a script provided in the ch17 example’s folder’s sql subfolder
Introduce the database’s tables
Use this database to introduce database concepts, such as operations that create, read, update and delete data
- So-called CRUD operations
We’ll use SQL and pandas DataFrames to show you each table’s contents
Then, in the next several sections, we’ll discuss additional SQL features

Creating the books Database¶

Instructor Note: This command was modified for use in this notebook
The following sqlite3 command creates a SQLite database named books.db and executes the books.sql SQL script, which defines how to create the database’s tables and populates them with data:

!sqlite3 books.db < ./sql/books.sql

The notation < indicates that books.sql is input into the sqlite3 command.
When the command completes, the database is ready for use.
Begin a new IPython session.

Connecting to the Database in Python¶

sqlite3 connect function connects to a database and returns a Connection object:

import sqlite3

connection = sqlite3.connect('books.db')

`authors` Table¶

The database has three tables—authors, author_ISBN and titles
The authors table stores the authors and has three columns:
- id
  - Autoincremented unique integer id number
  - primary key
- first—First name (a string)
- last—Last name (a string)

Viewing the `author`s Table’s Contents¶

Use SQL and pandas to display the authors table

import pandas as pd

pd.options.display.max_columns = 10

pd.read_sql('SELECT * FROM authors', connection,
            index_col=['id'])

Viewing the `author`s Table’s Contents (cont.)¶

Pandas function read_sql executes a SQL query and returns a DataFrame containing the query’s results
Arguments are:
- SQL query as a string
- Connection object
- Optionally, index_col keyword argument
  - column to use as the DataFrame’s row indices
  - index_col index values from 0, otherwise
SELECT gets rows and columns from one or more tables
```
SELECT * FROM authors
```
- Asterisk (*) wildcard gets all columns

Viewing the `titles` Table's Contents¶

titles table stores the books and has four columns:
- isbn
  - ISBN string is this table’s primary key
  - ISBNs give every book a unique identification number
- title—Book title string
- edition—Integer edition number
- copyright—Copyright year as a string

Viewing the `titles` Table's Contents (cont.)¶

pd.read_sql('SELECT * FROM titles', connection)

`author_ISBN` Table¶

Associates authors from the authors table with their books in the titles table:
- id—Integer author id
- isbn—String ISBN of a book
id column is a foreign key
- Must match a primary-key column in another table
- Matches authors table’s id primary-key column
isbn column is a foreign key
- Matches the titles table’s isbn primary-key column

`author_ISBN` Table (cont.)¶

A goal when designing a database is to minimize data duplication among the tables
To do this, each table represents a specific entity, and foreign keys help link the data in multiple tables
The primary keys and foreign keys are designated when you create the database tables (in our case, in the books.sql script)
Together the id and isbn columns in this table form a composite primary key
Every row in this table uniquely matches one author to one book’s ISBN

`author_ISBN` Table (cont.)¶

df = pd.read_sql('SELECT * FROM author_ISBN', connection)

df.head()  # view first 5 of this table's many entries

`author_ISBN` Table (cont.)¶

Rule of Referential Integrity Every foreign-key value must appear as the primary-key value in a row of another table so the DBMS can ensure that the foreign-key value is valid
One-to-many relationship between a primary key and a corresponding foreign key
- One author can write many books
- One book can be written by many authors
Foreign key can appear many times in its table but only once as the primary key in another table

Entity-Relationship (ER) Diagram for the `books` Database¶

The names in italic are primary keys
Rule of Entity Integrity
- Every row must have a unique primary-key value

Entity-relationship (ER) diagram for the books database showing the database’s tables and the relationships among them

Entity-Relationship (ER) Diagram (cont.)¶

The lines connecting the tables represent the relationships among the tables.
Consider the line between authors and author_ISBN.
- On the authors end there’s a 1, and on the author_ISBN end there’s an infinity symbol (¥).
- This indicates a one-to-many relationship.
- For each author in the authors table, there can be an arbitrary number of ISBNs for books written by that author in the author_ISBN table
- The line links authors table's id column (the primary key) to the author_ISBN table's id column (a foreign key)

Entity-Relationship (ER) Diagram (cont.)¶

The line between the titles and author_ISBN tables illustrates a one-to-many relationship
- One book can be written by many authors
- The line links the titles table's primary key isbn to the author_ISBN table's foreign key
Entity-relationship diagram illustrates that the sole purpose of the author_ISBN table is to provide a many-to-many relationship between the authors and titles tables
- An author can write many books, and a book can have many authors

SQL Keywords¶

The following subsections continue our SQL presentation in the context of our books database, demonstrating SQL queries and statements using the SQL keywords in the following table.

SQL keyword	Description
`SELECT`	Retrieves data from one or more tables.
`FROM`	Tables involved in the query. Required in every `SELECT`.
`WHERE`	Criteria for selection that determine the rows to be retrieved, deleted or updated. Optional in a SQL statement.
`GROUP BY`	Criteria for grouping rows. Optional in a `SELECT` query.
`ORDER BY`	Criteria for ordering rows. Optional in a `SELECT` query.
`INNER JOIN`	Merge rows from multiple tables.
`INSERT`	Insert rows into a specified table.
`UPDATE`	Update rows in a specified table.
`DELETE`	Delete rows from a specified table.

17.2.2 `SELECT` Queries¶

SELECT statements with the * wildcard character get all the columns from a table
Typically, you need only a subset, especially in big data where you could have dozens, hundreds, thousands or more columns
To retrieve specific columns, specify a comma-separated list of column names

pd.read_sql('SELECT first, last FROM authors', connection)

17.2.3 `WHERE` Clause Specifies Selection Criteria¶

Often select rows that satisfy certain selection criteria (predicates), especially in big data where a database might contain millions or billions of rows

pd.read_sql("""SELECT title, edition, copyright 
               FROM titles 
               WHERE copyright > '2016'""", connection)

Pattern Matching: Zero or More Characters¶

WHERE clause operators: <, >, <=, >=, =, <> (not equal) and LIKE
LIKE is used for pattern matching
percent (%) wildcard character matches zero or more characters

pd.read_sql("""SELECT id, first, last 
               FROM authors 
               WHERE last LIKE 'D%'""", 
            connection, index_col=['id'])

Pattern Matching Any Character¶

Use underscore ( _ )

pd.read_sql("""SELECT id, first, last 
               FROM authors 
               WHERE first LIKE '_b%'""", 
            connection, index_col=['id'])

17.2.4 `ORDER BY` Clause¶

Sorts a query’s results into ascending or descending order (highest to lowest), specified with ASC and DESC, respectively
Default is ASC

pd.read_sql('SELECT title FROM titles ORDER BY title ASC',
            connection)

Sorting By Multiple Columns¶

Specify comma-separated list of column names after the ORDER BY

pd.read_sql("""SELECT id, first, last 
               FROM authors 
               ORDER BY last, first""", 
            connection, index_col=['id'])

Sort order can vary by column

pd.read_sql("""SELECT id, first, last 
               FROM authors 
               ORDER BY last DESC, first ASC""", 
            connection, index_col=['id'])

Combining the `WHERE` and `ORDER` `BY` Clauses¶

pd.read_sql("""SELECT isbn, title, edition, copyright
               FROM titles
               WHERE title LIKE '%How to Program'
               ORDER BY title""", connection)

17.2.5 Merging Data from Multiple Tables: `INNER` `JOIN`¶

List of authors accompanied by the ISBNs for books written by each author
- Because there are many results for this query, we show just the head of the result

pd.read_sql("""SELECT first, last, isbn
               FROM authors
               INNER JOIN author_ISBN
                   ON authors.id = author_ISBN.id
               ORDER BY last, first""", connection).head()

17.2.5 Merging Data from Multiple Tables: `INNER JOIN` (cont.)¶

INNER JOIN’s ON clause uses a primary-key column in one table and a foreign-key column in the other to determine which rows to merge from each table.
This query merges the authors table’s first and last columns with the author_ISBN table’s isbn column and sorts the results in ascending order by last then first.
authors.id in the ON clause is a qualified name
- Required if the columns have the same name in both tables

17.2.6 `INSERT` `INTO` Statement¶

Use a sqlite3 Cursor object to modify the database
Obtain by calling the Connection’s cursor method

cursor = connection.cursor()

The INSERT INTO statement inserts a row into a table

17.2.6 `INSERT` `INTO` Statement¶

Insert a new author named Sue Red into the authors table by calling Cursor method execute

cursor = cursor.execute("""INSERT INTO authors (first, last)
                           VALUES ('Sue', 'Red')""")

List of column names is followed by SQL keyword VALUES and a comma-separated list of values in parentheses
The values provided must match the column names specified both in order and type

17.2.6 `INSERT` `INTO` Statement¶

Because id is autoincremented, for every new row, SQLite assigns a unique id value that is the next value in the autoincremented sequence

pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])

Note Regarding Strings That Contain Single Quotes¶

SQL delimits strings with single quotes (').
A string containing a single quote, such as O'Malley, must have two single quotes in the position where the single quote appears (e.g., 'O''Malley').

17.2.7 `UPDATE` Statement¶

Modifies existing values

cursor = cursor.execute("""UPDATE authors SET last='Black'
                           WHERE last='Red' AND first='Sue'""")

The UPDATE keyword is followed by the table to update, the keyword SET and a comma-separated list of column_name = value pairs indicating the columns to change and their new values
Change applied to every row if you do not specify a WHERE clause

17.2.7 `UPDATE` Statement (cont.)¶

For statements that modify the database, the Cursor object’s rowcount attribute contains an integer value representing the number of rows that were modified
If this value is 0, no changes were made
Confirm that the UPDATE modified one row:

cursor.rowcount

1

Also can confirm the update by listing the authors table’s contents:

pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])

17.2.8 `DELETE` `FROM` Statement¶

Removes rows from a table

cursor = cursor.execute('DELETE FROM authors WHERE id=6')

cursor.rowcount

1

The optional WHERE clause determines which rows to delete
If WHERE is omitted, all the table’s rows are deleted
authors table after the DELETE operation

pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])

Closing the Database¶

When you no longer need access to the database, disconnect by calling the Connection’s close method

connection.close()

SQL in Big Data¶

SQL’s importance is growing in big data.
Later in this chapter, we’ll use Spark SQL to query data in a Spark DataFrame for which the data may be distributed over many computers in a Spark cluster.
As you’ll see, Spark SQL looks much like the SQL presented in this section.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	first	last
id
1	Paul	Deitel
2	Harvey	Deitel
3	Abbey	Deitel
4	Dan	Quirk
5	Alexander	Wald

	isbn	title	edition	copyright
0	0135404673	Intro to Python for CS and DS	1	2020
1	0132151006	Internet & WWW How to Program	5	2012
2	0134743350	Java How to Program	11	2018
3	0133976890	C How to Program	8	2016
4	0133406954	Visual Basic 2012 How to Program	6	2014
5	0134601548	Visual C# How to Program	6	2017
6	0136151574	Visual C++ How to Program	2	2008
7	0134448235	C++ How to Program	10	2017
8	0134444302	Android How to Program	3	2017
9	0134289366	Android 6 for Programmers	3	2016

	id	isbn
0	1	0134289366
1	2	0134289366
2	5	0134289366
3	1	0135404673
4	2	0135404673

	first	last
0	Paul	Deitel
1	Harvey	Deitel
2	Abbey	Deitel
3	Dan	Quirk
4	Alexander	Wald

	title	edition	copyright
0	Intro to Python for CS and DS	1	2020
1	Java How to Program	11	2018
2	Visual C# How to Program	6	2017
3	C++ How to Program	10	2017
4	Android How to Program	3	2017

17.2 Relational Databases and Structured Query Language (SQL)¶

17.2 Relational Databases and Structured Query Language (SQL) (cont.)¶

Tables, Rows and Columns¶

Tables, Rows and Columns (cont.)¶

Selecting Data Subsets¶

SQLite¶

17.2.1 A books Database¶

Creating the books Database¶

Connecting to the Database in Python¶

authors Table¶

Viewing the authors Table’s Contents¶

Viewing the authors Table’s Contents (cont.)¶

Viewing the titles Table's Contents¶

Viewing the titles Table's Contents (cont.)¶

author_ISBN Table¶

author_ISBN Table (cont.)¶

author_ISBN Table (cont.)¶

author_ISBN Table (cont.)¶

Entity-Relationship (ER) Diagram for the books Database¶

Entity-Relationship (ER) Diagram (cont.)¶

Entity-Relationship (ER) Diagram (cont.)¶

SQL Keywords¶

17.2.2 SELECT Queries¶

17.2.3 WHERE Clause Specifies Selection Criteria¶

Pattern Matching: Zero or More Characters¶

Pattern Matching Any Character¶

17.2.4 ORDER BY Clause¶

Sorting By Multiple Columns¶

Combining the WHERE and ORDER BY Clauses¶

17.2.5 Merging Data from Multiple Tables: INNER JOIN¶

17.2.5 Merging Data from Multiple Tables: INNER JOIN (cont.)¶

17.2.6 INSERT INTO Statement¶

17.2.6 INSERT INTO Statement¶

17.2.6 INSERT INTO Statement¶

Note Regarding Strings That Contain Single Quotes¶

17.2.7 UPDATE Statement¶

17.2.7 UPDATE Statement (cont.)¶

17.2.8 DELETE FROM Statement¶

Closing the Database¶

SQL in Big Data¶

`authors` Table¶

Viewing the `author`s Table’s Contents¶

Viewing the `author`s Table’s Contents (cont.)¶

Viewing the `titles` Table's Contents¶

Viewing the `titles` Table's Contents (cont.)¶

`author_ISBN` Table¶

`author_ISBN` Table (cont.)¶

`author_ISBN` Table (cont.)¶

`author_ISBN` Table (cont.)¶

Entity-Relationship (ER) Diagram for the `books` Database¶

17.2.2 `SELECT` Queries¶

17.2.3 `WHERE` Clause Specifies Selection Criteria¶

17.2.4 `ORDER BY` Clause¶

Combining the `WHERE` and `ORDER` `BY` Clauses¶

17.2.5 Merging Data from Multiple Tables: `INNER` `JOIN`¶

17.2.5 Merging Data from Multiple Tables: `INNER JOIN` (cont.)¶

17.2.6 `INSERT` `INTO` Statement¶

17.2.6 `INSERT` `INTO` Statement¶

17.2.6 `INSERT` `INTO` Statement¶

17.2.7 `UPDATE` Statement¶

17.2.7 `UPDATE` Statement (cont.)¶

17.2.8 `DELETE` `FROM` Statement¶