17.2 Relational Databases and Structured Query Language (SQL)

  • Databases are crucial, especially for big data
  • Chapter 9 covered sequential text-file processing, working with data from CSV files and working with JSON
    • Both useful when most or all of a file’s data is to be processed
  • In transaction processing we need to locate and, possibly, update an individual data item quickly

17.2 Relational Databases and Structured Query Language (SQL) (cont.)

  • A database is an integrated collection of data
  • A database management system (DBMS) provides mechanisms for storing and organizing data
    • Allows for rapid access and storage of data without concern for the internal representation of databases
  • Relational database management systems (RDBMSs) store data in tables and define relationships among the tables
  • Structured Query Language (SQL)—which we pronounce as “see-quel”—is used almost universally with relational database systems to manipulate data and perform queries**, which request information that satisfies given criteria
  • Popular open-source RDBMSs include SQLite, PostgreSQL, MariaDB and MySQL
    • These can be downloaded and used freely by anyone
    • All have support for Python
  • Here we use SQLite, which is bundled with Python
  • Popular proprietary RDBMSs include Microsoft SQL Server, Oracle, Sybase and IBM Db2

Tables, Rows and Columns

  • Relational database — logical table-based representation of data
  • Employee table that might be used in a personnel system to store employees’ attributes: A sample Employee table that might be used in a personnel system

Tables, Rows and Columns (cont.)

  • Table rows each describe a single entity
    • Here, each row represents one employee
  • Columns contain individual attribute values
  • Primary key—a column (or group of columns) with a value that’s unique for each row, so each row can be identified by its primary key
  • Examples of primary keys
    • Social Security numbers
    • employee ID numbers
    • part numbers in an inventory system

Selecting Data Subsets

  • Structured Query Language (SQL) queries specify subsets of the data to select from a table
    • E.g., select from the Employee table each department and location, and present the data in increasing order by department number

Selected data showing where each department is located, sorted in increasing order by department number

SQLite

  • Open-source database management system included with Python
  • Most popular database systems have Python support

17.2.1 A books Database

  • books database contains information about several of our books
  • Will set up the database in SQLite via the Python Standard Library’s sqlite3 module, using a script provided in the ch17 example’s folder’s sql subfolder
  • Introduce the database’s tables
  • Use this database to introduce database concepts, such as operations that create, read, update and delete data
    • So-called CRUD operations
  • We’ll use SQL and pandas DataFrames to show you each table’s contents
  • Then, in the next several sections, we’ll discuss additional SQL features

Creating the books Database

  • Instructor Note: This command was modified for use in this notebook
  • The following sqlite3 command creates a SQLite database named books.db and executes the books.sql SQL script, which defines how to create the database’s tables and populates them with data:
In [1]:
!sqlite3 books.db < ./sql/books.sql
  • The notation < indicates that books.sql is input into the sqlite3 command.
  • When the command completes, the database is ready for use.
  • Begin a new IPython session.

Connecting to the Database in Python

  • sqlite3 connect function connects to a database and returns a Connection object:
In [2]:
import sqlite3
In [3]:
connection = sqlite3.connect('books.db')

authors Table

  • The database has three tables—authors, author_ISBN and titles
  • The authors table stores the authors and has three columns:
    • id
      • Autoincremented unique integer id number
      • primary key
    • first—First name (a string)
    • last—Last name (a string)

Viewing the authors Table’s Contents

  • Use SQL and pandas to display the authors table
In [4]:
import pandas as pd
In [5]:
pd.options.display.max_columns = 10
In [6]:
pd.read_sql('SELECT * FROM authors', connection,
            index_col=['id'])
Out[6]:
first last
id
1 Paul Deitel
2 Harvey Deitel
3 Abbey Deitel
4 Dan Quirk
5 Alexander Wald

Viewing the authors Table’s Contents (cont.)

  • Pandas function read_sql executes a SQL query and returns a DataFrame containing the query’s results
  • Arguments are:
    • SQL query as a string
    • Connection object
    • Optionally, index_col keyword argument
      • column to use as the DataFrame’s row indices
      • index_col index values from 0, otherwise
  • SELECT gets rows and columns from one or more tables
  • SELECT * FROM authors
    
    • Asterisk (*) wildcard gets all columns

Viewing the titles Table's Contents

  • titles table stores the books and has four columns:
    • isbn
      • ISBN string is this table’s primary key
      • ISBNs give every book a unique identification number
    • title—Book title string
    • edition—Integer edition number
    • copyright—Copyright year as a string

Viewing the titles Table's Contents (cont.)

In [7]:
pd.read_sql('SELECT * FROM titles', connection)
Out[7]:
isbn title edition copyright
0 0135404673 Intro to Python for CS and DS 1 2020
1 0132151006 Internet & WWW How to Program 5 2012
2 0134743350 Java How to Program 11 2018
3 0133976890 C How to Program 8 2016
4 0133406954 Visual Basic 2012 How to Program 6 2014
5 0134601548 Visual C# How to Program 6 2017
6 0136151574 Visual C++ How to Program 2 2008
7 0134448235 C++ How to Program 10 2017
8 0134444302 Android How to Program 3 2017
9 0134289366 Android 6 for Programmers 3 2016

author_ISBN Table

  • Associates authors from the authors table with their books in the titles table:
    • id—Integer author id
    • isbn—String ISBN of a book
  • id column is a foreign key
    • Must match a primary-key column in another table
    • Matches authors table’s id primary-key column
  • isbn column is a foreign key
    • Matches the titles table’s isbn primary-key column

author_ISBN Table (cont.)

  • A goal when designing a database is to minimize data duplication among the tables
  • To do this, each table represents a specific entity, and foreign keys help link the data in multiple tables
  • The primary keys and foreign keys are designated when you create the database tables (in our case, in the books.sql script)
  • Together the id and isbn columns in this table form a composite primary key
  • Every row in this table uniquely matches one author to one book’s ISBN

author_ISBN Table (cont.)

In [8]:
df = pd.read_sql('SELECT * FROM author_ISBN', connection)
In [9]:
df.head()  # view first 5 of this table's many entries
Out[9]:
id isbn
0 1 0134289366
1 2 0134289366
2 5 0134289366
3 1 0135404673
4 2 0135404673

author_ISBN Table (cont.)

  • Rule of Referential Integrity Every foreign-key value must appear as the primary-key value in a row of another table so the DBMS can ensure that the foreign-key value is valid
  • One-to-many relationship between a primary key and a corresponding foreign key
    • One author can write many books
    • One book can be written by many authors
  • Foreign key can appear many times in its table but only once as the primary key in another table

Entity-Relationship (ER) Diagram for the books Database

  • The names in italic are primary keys
  • Rule of Entity Integrity
    • Every row must have a unique primary-key value

Entity-relationship (ER) diagram for the books database showing the database’s tables and the relationships among them

Entity-Relationship (ER) Diagram (cont.)

  • The lines connecting the tables represent the relationships among the tables.
  • Consider the line between authors and author_ISBN.
    • On the authors end there’s a 1, and on the author_ISBN end there’s an infinity symbol (¥).
    • This indicates a one-to-many relationship.
    • For each author in the authors table, there can be an arbitrary number of ISBNs for books written by that author in the author_ISBN table
    • The line links authors table's id column (the primary key) to the author_ISBN table's id column (a foreign key)

Entity-Relationship (ER) Diagram (cont.)

  • The line between the titles and author_ISBN tables illustrates a one-to-many relationship
    • One book can be written by many authors
    • The line links the titles table's primary key isbn to the author_ISBN table's foreign key
  • Entity-relationship diagram illustrates that the sole purpose of the author_ISBN table is to provide a many-to-many relationship between the authors and titles tables
    • An author can write many books, and a book can have many authors

SQL Keywords

  • The following subsections continue our SQL presentation in the context of our books database, demonstrating SQL queries and statements using the SQL keywords in the following table.
SQL keyword        Description
SELECT Retrieves data from one or more tables.
FROM Tables involved in the query. Required in every SELECT.
WHERE Criteria for selection that determine the rows to be retrieved, deleted or updated. Optional in a SQL statement.
GROUP BY Criteria for grouping rows. Optional in a SELECT query.
ORDER BY Criteria for ordering rows. Optional in a SELECT query.
INNER JOIN Merge rows from multiple tables.
INSERT Insert rows into a specified table.
UPDATE Update rows in a specified table.
DELETE Delete rows from a specified table.

17.2.2 SELECT Queries

  • SELECT statements with the * wildcard character get all the columns from a table
  • Typically, you need only a subset, especially in big data where you could have dozens, hundreds, thousands or more columns
  • To retrieve specific columns, specify a comma-separated list of column names
In [10]:
pd.read_sql('SELECT first, last FROM authors', connection)
Out[10]:
first last
0 Paul Deitel
1 Harvey Deitel
2 Abbey Deitel
3 Dan Quirk
4 Alexander Wald

17.2.3 WHERE Clause Specifies Selection Criteria

  • Often select rows that satisfy certain selection criteria (predicates), especially in big data where a database might contain millions or billions of rows
In [11]:
pd.read_sql("""SELECT title, edition, copyright 
               FROM titles 
               WHERE copyright > '2016'""", connection)
Out[11]:
title edition copyright
0 Intro to Python for CS and DS 1 2020
1 Java How to Program 11 2018
2 Visual C# How to Program 6 2017
3 C++ How to Program 10 2017
4 Android How to Program 3 2017

Pattern Matching: Zero or More Characters

  • WHERE clause operators: <, >, <=, >=, =, <> (not equal) and LIKE
  • LIKE is used for pattern matching
  • percent (%) wildcard character matches zero or more characters
In [12]:
pd.read_sql("""SELECT id, first, last 
               FROM authors 
               WHERE last LIKE 'D%'""", 
            connection, index_col=['id'])
Out[12]:
first last
id
1 Paul Deitel
2 Harvey Deitel
3 Abbey Deitel

Pattern Matching Any Character

  • Use underscore ( _ )
In [13]:
pd.read_sql("""SELECT id, first, last 
               FROM authors 
               WHERE first LIKE '_b%'""", 
            connection, index_col=['id'])
Out[13]:
first last
id
3 Abbey Deitel

17.2.4 ORDER BY Clause

  • Sorts a query’s results into ascending or descending order (highest to lowest), specified with ASC and DESC, respectively
  • Default is ASC
In [14]:
pd.read_sql('SELECT title FROM titles ORDER BY title ASC',
            connection)
Out[14]:
title
0 Android 6 for Programmers
1 Android How to Program
2 C How to Program
3 C++ How to Program
4 Internet & WWW How to Program
5 Intro to Python for CS and DS
6 Java How to Program
7 Visual Basic 2012 How to Program
8 Visual C# How to Program
9 Visual C++ How to Program

Sorting By Multiple Columns

  • Specify comma-separated list of column names after the ORDER BY
In [15]:
pd.read_sql("""SELECT id, first, last 
               FROM authors 
               ORDER BY last, first""", 
            connection, index_col=['id'])
Out[15]:
first last
id
3 Abbey Deitel
2 Harvey Deitel
1 Paul Deitel
4 Dan Quirk
5 Alexander Wald
  • Sort order can vary by column
In [16]:
pd.read_sql("""SELECT id, first, last 
               FROM authors 
               ORDER BY last DESC, first ASC""", 
            connection, index_col=['id'])
Out[16]:
first last
id
5 Alexander Wald
4 Dan Quirk
3 Abbey Deitel
2 Harvey Deitel
1 Paul Deitel

Combining the WHERE and ORDER BY Clauses

In [17]:
pd.read_sql("""SELECT isbn, title, edition, copyright
               FROM titles
               WHERE title LIKE '%How to Program'
               ORDER BY title""", connection)
Out[17]:
isbn title edition copyright
0 0134444302 Android How to Program 3 2017
1 0133976890 C How to Program 8 2016
2 0134448235 C++ How to Program 10 2017
3 0132151006 Internet & WWW How to Program 5 2012
4 0134743350 Java How to Program 11 2018
5 0133406954 Visual Basic 2012 How to Program 6 2014
6 0134601548 Visual C# How to Program 6 2017
7 0136151574 Visual C++ How to Program 2 2008

17.2.5 Merging Data from Multiple Tables: INNER JOIN

  • List of authors accompanied by the ISBNs for books written by each author
    • Because there are many results for this query, we show just the head of the result
In [18]:
pd.read_sql("""SELECT first, last, isbn
               FROM authors
               INNER JOIN author_ISBN
                   ON authors.id = author_ISBN.id
               ORDER BY last, first""", connection).head()
Out[18]:
first last isbn
0 Abbey Deitel 0132151006
1 Abbey Deitel 0133406954
2 Harvey Deitel 0134289366
3 Harvey Deitel 0135404673
4 Harvey Deitel 0132151006

17.2.5 Merging Data from Multiple Tables: INNER JOIN (cont.)

  • INNER JOIN’s ON clause uses a primary-key column in one table and a foreign-key column in the other to determine which rows to merge from each table.
  • This query merges the authors table’s first and last columns with the author_ISBN table’s isbn column and sorts the results in ascending order by last then first.
  • authors.id in the ON clause is a qualified name
    • Required if the columns have the same name in both tables

17.2.6 INSERT INTO Statement

  • Use a sqlite3 Cursor object to modify the database
  • Obtain by calling the Connection’s cursor method
In [19]:
cursor = connection.cursor()
  • The INSERT INTO statement inserts a row into a table

17.2.6 INSERT INTO Statement

  • Insert a new author named Sue Red into the authors table by calling Cursor method execute
In [20]:
cursor = cursor.execute("""INSERT INTO authors (first, last)
                           VALUES ('Sue', 'Red')""")
  • List of column names is followed by SQL keyword VALUES and a comma-separated list of values in parentheses
  • The values provided must match the column names specified both in order and type

17.2.6 INSERT INTO Statement

  • Because id is autoincremented, for every new row, SQLite assigns a unique id value that is the next value in the autoincremented sequence
In [21]:
pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])
Out[21]:
first last
id
1 Paul Deitel
2 Harvey Deitel
3 Abbey Deitel
4 Dan Quirk
5 Alexander Wald
6 Sue Red

Note Regarding Strings That Contain Single Quotes

  • SQL delimits strings with single quotes (').
  • A string containing a single quote, such as O'Malley, must have two single quotes in the position where the single quote appears (e.g., 'O''Malley').

17.2.7 UPDATE Statement

  • Modifies existing values
In [22]:
cursor = cursor.execute("""UPDATE authors SET last='Black'
                           WHERE last='Red' AND first='Sue'""") 
  • The UPDATE keyword is followed by the table to update, the keyword SET and a comma-separated list of column_name = value pairs indicating the columns to change and their new values
  • Change applied to every row if you do not specify a WHERE clause

17.2.7 UPDATE Statement (cont.)

  • For statements that modify the database, the Cursor object’s rowcount attribute contains an integer value representing the number of rows that were modified
  • If this value is 0, no changes were made
  • Confirm that the UPDATE modified one row:
In [23]:
cursor.rowcount
Out[23]:
1
  • Also can confirm the update by listing the authors table’s contents:
In [24]:
pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])
Out[24]:
first last
id
1 Paul Deitel
2 Harvey Deitel
3 Abbey Deitel
4 Dan Quirk
5 Alexander Wald
6 Sue Black

17.2.8 DELETE FROM Statement

  • Removes rows from a table
In [25]:
cursor = cursor.execute('DELETE FROM authors WHERE id=6') 
In [26]:
cursor.rowcount
Out[26]:
1
  • The optional WHERE clause determines which rows to delete
  • If WHERE is omitted, all the table’s rows are deleted
  • authors table after the DELETE operation
In [27]:
pd.read_sql('SELECT id, first, last FROM authors', 
            connection, index_col=['id'])
Out[27]:
first last
id
1 Paul Deitel
2 Harvey Deitel
3 Abbey Deitel
4 Dan Quirk
5 Alexander Wald

Closing the Database

  • When you no longer need access to the database, disconnect by calling the Connection’s close method
In [28]:
connection.close()

SQL in Big Data

  • SQL’s importance is growing in big data.
  • Later in this chapter, we’ll use Spark SQL to query data in a Spark DataFrame for which the data may be distributed over many computers in a Spark cluster.
  • As you’ll see, Spark SQL looks much like the SQL presented in this section.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.