15.4 Case Study: Time Series and Simple Linear Regression¶

Simple linear regression is the simplest regression algorithm
Given a collection of numeric values representing an independent variable and a dependent variable, simple linear regression describes the relationship between these variables with a straight line, known as the regression line
Using a time series of average New York City January high-temperature data for 1895 through 2018, we'll
- Perform simple linear regression
- Display a scatter plot with a regression line
- Use the coefficient and intercept values calculated by the estimator to make predictions
Temperature data stored in ave_hi_nyc_jan_1895-2018.csv

Loading the Average High Temperatures into a `DataFrame`¶

Load the data from ave_hi_nyc_jan_1895-2018.csv, rename the 'Value' column to 'Temperature', remove 01 from the end of each date value and display a few data samples:

We added %matplotlib inline to enable Matplotlib in this notebook.

%matplotlib inline
import pandas as pd

nyc = pd.read_csv('ave_hi_nyc_jan_1895-2018.csv')

nyc.head(3)

nyc.columns = ['Date', 'Temperature', 'Anomaly']

nyc.Date = nyc.Date.floordiv(100)

nyc.head(3)

Splitting the Data for Training and Testing (1 of 3)¶

We’ll use the LinearRegression estimator from sklearn.linear_model
By default, this estimator uses all the numeric features in a dataset to perform multiple linear regression
For simple linear regression select one feature (the Date here) as the independent variable
- A column in DataFrame is a one-dimensional Series
- Scikit-learn estimators require training and testing data to be two-dimensional
- We'll transform Series of n elements, into two dimensions containing n rows and one column

Splitting the Data for Training and Testing (2 of 3)¶

nyc.Date.values returns NumPy array containing Date column’s values
reshape(-1, 1) tells reshape to infer the number of rows, based on the number of columns (1) and the number of elements (124) in the array
- Transformed array will have 124 rows and one column

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    nyc.Date.values.reshape(-1, 1), nyc.Temperature.values, random_state=11)

Splitting the Data for Training and Testing (3 of 3)¶

Confirm the 75%–25% train-test split

X_train.shape

(93, 1)

X_test.shape

(31, 1)

Training the Model (1 of 2)¶

LinearRegression default settings

from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()

linear_regression.fit(X=X_train, y=y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

To find the best fitting regression line for the data, the LinearRegression estimator iteratively adjusts the slope and intercept to minimize the sum of the squares of the data points’ distances from the line

Training the Model (2 of 2)¶

We'll soon use slope and intercept to make predictions with

\begin{equation} y = m x + b \end{equation}

Slope is the estimator’s coeff_ attribute (m in the equation)
Intercept is the estimator’s intercept_ attribute (b in the equation)

linear_regression.coef_

array([0.01939167])

linear_regression.intercept_

-0.30779820252656265

Testing the Model¶

Test the model using the data in X_test and check some of the predictions

predicted = linear_regression.predict(X_test)

expected = y_test

for p, e in zip(predicted[::5], expected[::5]):  # check every 5th element
    print(f'predicted: {p:.2f}, expected: {e:.2f}')

predicted: 37.86, expected: 31.70
predicted: 38.69, expected: 34.80
predicted: 37.00, expected: 39.40
predicted: 37.25, expected: 45.70
predicted: 38.05, expected: 32.30
predicted: 37.64, expected: 33.80
predicted: 36.94, expected: 39.70

Predicting Future Temperatures and Estimating Past Temperatures¶

Use the coefficient and intercept values to make predictions

# lambda implements y = mx + b
predict = (lambda x: linear_regression.coef_ * x + 
                     linear_regression.intercept_)

predict(2019)

array([38.84399018])

predict(1890)

array([36.34246432])

Visualizing the Dataset with the Regression Line¶

Create a scatter plot with a regression line
Cooler temperatures shown in darker colors
Instructor Note: All code that modifies a graph must be in the same notebook cell

import seaborn as sns

axes = sns.scatterplot(data=nyc, x='Date', y='Temperature',
    hue='Temperature', palette='winter', legend=False)  

axes.set_ylim(10, 70)  # scale y-axis 

import numpy as np

x = np.array([min(nyc.Date.values), max(nyc.Date.values)])

y = predict(x)

import matplotlib.pyplot as plt

line = plt.plot(x, y)

Overfitting/Underfitting¶

Common problems that prevent accurate predictions
When creating a model, key goal is making accurate predictions for data it has not yet seen
Underfitting occurs when a model is too simple to make predictions, based on its training data
- You may use a linear model, such as simple linear regression, when problem really requires a non-linear model
Overfitting occurs when your model is too complex
- Most extreme case would be a model that memorizes its training data
- New data that matches the training data will produce perfect predictions, but the model will not know what to do with data it has never seen.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	Date	Value	Anomaly
0	189501	34.2	-3.2
1	189601	34.7	-2.7
2	189701	35.5	-1.9

	Date	Temperature	Anomaly
0	1895	34.2	-3.2
1	1896	34.7	-2.7
2	1897	35.5	-1.9