10.16 Intro to Data Science: Time Series and Simple Linear Regression¶

Time series: Sequences of values (observations) associated with points in time
- daily closing stock prices
- hourly temperature readings
- changing positions of a plane in flight
- annual crop yields
- quarterly company profits
- time-stamped tweets from Twitter users worldwide
We’ll use simple linear regression to make predictions from time series data

Time Series¶

Univariate time series: One observation per time
Multivariate time series: Two or more observations per time
Two tasks often performed with time series are:
- Time series analysis, which looks at existing time series data for patterns (like seasonality), helping data analysts understand the data
- Time series forecasting, which uses past data to predict the future
We’ll perform time series forecasting

Simple Linear Regression¶

Given a collection of values representing an independent variable (the month/year combination) and a dependent variable (the average high temperature for that month/year), simple linear regression describes the relationship between these variables with a straight line, known as the regression line

Linear Relationships¶

Given a Fahrenheit temperature, we can calculate the corresponding Celsius temperature using:
```
c = 5 / 9 * (f - 32)
```
f (the Fahrenheit temperature) is the independent variable
c (the Celsius temperature) is the dependent variable
Each value of c depends on the value of f used in the calculation

Linear Relationships (cont.)¶

Plotting Fahrenheit temperatures and their corresponding Celsius temperatures produces a straight line

# enable high-res images in notebook 
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
c = lambda f: 5 / 9 * (f - 32)

temps = [(f, c(f)) for f in range(0, 101, 10)]

Place the data in a DataFrame, then use its plot method to display the linear relationship between the temperatures
style keyword argument controls the data’s appearance
- '.-' indicates that each point should appear as a dot, and that lines should connect the dots

import pandas as pd

temps_df = pd.DataFrame(temps, columns=['Fahrenheit', 'Celsius'])

axes = temps_df.plot(x='Fahrenheit', y='Celsius', style='.-')
y_label = axes.set_ylabel('Celsius')

Components of the Simple Linear Regression Equation¶

The points along any straight line can be calculated with:

\begin{equation} y = m x + b \end{equation}

m is the line’s slope,
b is the line’s `intercept with the y-axis (at x = 0),
x is the independent variable (the date in this example)
y is the dependent variable (the temperature in this example)
In simple linear regression, y is the predicted value for a given x

Function `linregress` from the SciPy’s `stats` Module¶

Simple linear regression determines slope (m) and intercept (b) of a straight line that best fits your data
Following diagram shows a few of the time-series data points we’ll process in this section and a corresponding regression line
- We added vertical lines to indicate each data point’s distance from the regression line

A few time series data points and a regression line

Function `linregress` from the SciPy’s `stats` Module (cont.)¶

Simple linear regression algorithm iteratively adjusts the slope and intercept and, for each adjustment, calculates the square of each point’s distance from the line
“Best fit” occurs when slope and intercept values minimize sum of those squared distances
- ordinary least squares calculation
SciPy (Scientific Python) is widely used for engineering, science and math in Python
- linregress function (from the scipy.stats module) performs simple linear regression for you

Getting Weather Data from NOAA¶

The National Oceanic and Atmospheric Administration (NOAA) offers public historical data including time series for average high temperatures in specific cities over various time intervals
Obtained the January average high temperatures for New York City from 1895 through 2018 from NOAA’s “Climate at a Glance” time series at:
https://www.ncdc.noaa.gov/cag/
ave_hi_nyc_jan_1895-2018.csv in the ch10 examples folder
Three columns per observation:
- Date—A value of the form 'YYYYMM’ (such as '201801'). MM is always 01 because we downloaded data for only January of each year.
- Value—A floating-point Fahrenheit temperature.
- Anomaly—The difference between the value for the given date and average values for all dates (not used in this example)

Loading the Average High Temperatures into a `DataFrame`¶

nyc = pd.read_csv('ave_hi_nyc_jan_1895-2018.csv')

Get a sense of the data

nyc.head()

nyc.tail()

Cleaning the Data¶

For readability, rename the 'Value' column as 'Temperature'

nyc.columns = ['Date', 'Temperature', 'Anomaly']

nyc.head(3)

Seaborn labels the tick marks on the x-axis with Date values
x-axis labels will be more readable if they do not contain 01 (for January), so we’ll remove it from each Date
Check the column’s type:

nyc.Date.dtype

dtype('int64')

Values are integers, so we can divide by 100 to truncate the last two digits
Series method floordiv performs integer division on every element of the Series

nyc.Date = nyc.Date.floordiv(100)

nyc.head(3)

Calculating Basic Descriptive Statistics for the Dataset¶

Call describe on the Temperature column

pd.set_option('precision', 2)

nyc.Temperature.describe()

count    124.00
mean      37.60
std        4.54
min       26.10
25%       34.58
50%       37.60
75%       40.60
max       47.60
Name: Temperature, dtype: float64

Forecasting Future January Average High Temperatures¶

SciPy (Scientific Python) library widely used for engineering, science and math in Python
stats module provides function linregress, which calculates a regression line’s slope and intercept

from scipy import stats

linear_regression = stats.linregress(x=nyc.Date,
                                     y=nyc.Temperature)

linregress receives two one-dimensional arrays of the same length representing the data points’ x- and y-coordinates
- xand y represent the independent and dependent variables, respectively
Returns the regression line’s slope and intercept

linear_regression.slope

0.014771361132966163

linear_regression.intercept

8.694993233674289

Use these values with the simple linear regression equation for a straight line to predict the average January temperature in New York City for a given year
In the following calculation, linear_regression.slope is m, 2019 is x (the date value for which you’d like to predict the temperature), and linear_regression.intercept is b:

linear_regression.slope * 2019 + linear_regression.intercept

38.51837136113297

Approximate the average temperature for January of 1890:

linear_regression.slope * 1890 + linear_regression.intercept

36.612865774980335

We had data for 1895–2018
The further you go outside this range, the less reliable the predictions will be

Plotting the Average High Temperatures and a Regression Line¶

Seaborn’s regplot function plots each data point with the dates on the x**-axis and the temperatures on the y-axis
Creates a scatter plot or scattergram representing the Temperatures for the given Dates and adds the regression line
Function regplot’s x and y keyword arguments are one-dimensional arrays of the same length representing the x-y coordinate pairs to plot

import seaborn as sns

sns.set_style('whitegrid')

axes = sns.regplot(x=nyc.Date, y=nyc.Temperature)
axes.set_ylim(10, 70)

(10, 70)

In this graph, the y-axis represents a 21.5-degree temperature range between the minimum of 26.1 and the maximum of 47.6
By default, the data appears to be spread significantly above and below the regression line, making it difficult to see the linear relationship
Common issue in data analytics visualizations
Seaborn and Matplotlib auto-scale the axes, based on the data’s range of values
We scaled the y-axis range of values to emphasize the linear relationship

Getting Time Series Datasets¶

Sources time-series dataset
https://data.gov/
This is the U.S. government’s open data portal. Searching for “time series” yields over 7200 time-series datasets.
https://www.ncdc.noaa.gov/cag/`
The National Oceanic and Atmospheric Administration (NOAA) Climate at a Glance portal provides both global and U.S. weather-related time series.
https://www.esrl.noaa.gov/psd/data/timeseries/
NOAA’s Earth System Research Laboratory (ESRL) portal provides monthly and seasonal climate-related time series.
https://www.quandl.com/search
Quandl provides hundreds of free financial-related time series, as well as fee-based time series.
https://datamarket.com/data/list/?q=provider:tsdl
The Time Series Data Library (TSDL) provides links to hundreds of time series datasets across many industries.
http://archive.ics.uci.edu/ml/datasets.html
The University of California Irvine (UCI) Machine Learning Repository contains dozens of time-series datasets for a variety of topics.
http://inforumweb.umd.edu/econdata/econdata.html
The University of Maryland’s EconData service provides links to thousands of economic time series from various U.S. government agencies.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	Date	Value	Anomaly
0	189501	34.2	-3.2
1	189601	34.7	-2.7
2	189701	35.5	-1.9
3	189801	39.6	2.2
4	189901	36.4	-1.0

	Date	Value	Anomaly
119	201401	35.5	-1.9
120	201501	36.1	-1.3
121	201601	40.8	3.4
122	201701	42.8	5.4
123	201801	38.7	1.3

	Date	Temperature	Anomaly
0	189501	34.2	-3.2
1	189601	34.7	-2.7
2	189701	35.5	-1.9

	Date	Temperature	Anomaly
0	1895	34.2	-3.2
1	1896	34.7	-2.7
2	1897	35.5	-1.9