15.5 Case Study: Multiple Linear Regression with the California Housing Dataset¶

California Housing dataset bundled with scikit-learn
Larger real-world dataset 20,640 samples, each with eight numerical features
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297. Submitted to the StatLib Datasets Archive by Kelley Pace (kpace@unix1.sncc.lsu.edu). [9/Nov/99].
Perform multiple linear regression using all eight numerical features
- Make more sophisticated housing price predictions than if we were to use only a single feature or a subset of the features
LinearRegression estimator performs multiple linear regression by default

15.5.1 Loading the Dataset (1 of 3)¶

According to the California Housing Prices dataset’s description in scikit-learn
"This dataset was derived from the 1990 U.S. census, using one row per census block group.
"A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (typically has a population of 600 to 3,000 people)."

15.5.1 Loading the Dataset (2 of 3)¶

The dataset has 20,640 samples—one per block group—with eight features each:
- median income—in tens of thousands, so 8.37 would represent $83,700
- median house age—in the dataset, the maximum value for this feature is 52
- average number of rooms
- average number of bedrooms
- block population
- average house occupancy
- house block latitude
- house block longitude

15.5.1 Loading the Dataset (3 of 3)¶

Target — median house value in hundreds of thousands, so 3.55 would represent \$355,000
- Maximum for this feature is 5 for \$500,000
Reasonable to expect more bedrooms, more rooms or higher income would mean higher house value
Combine all numeric features to make predictions
- More likely to get more accurate predictions than with simple linear regression

Loading the Data¶

Use sklearn.datasets function fetch_california_housing
We added %matplotlib inline to enable Matplotlib in this notebook.

%matplotlib inline
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()  # Bunch object

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\sakyokus\scikit_learn_data

print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

Displaying the Dataset’s Description¶

print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

Confirm number of samples/features, number of targets, feature names

california.data.shape

(20640, 8)

california.target.shape

(20640,)

california.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

15.5.2 Exploring the Data with a Pandas¶

import pandas as pd

pd.set_option('precision', 4)  # 4 digit precision for floats

# Used for command line outputs in IPython interactive mode
#pd.set_option('max_columns', 9)  # display up to 9 columns in DataFrame outputs

#pd.set_option('display.width', None)  # auto-detect the display width for wrapping

Second statement adds a DataFrame column for median house values

california_df = pd.DataFrame(california.data, 
                             columns=california.feature_names)

california_df['MedHouseValue'] = pd.Series(california.target)

15.5.2 Exploring the Data with a Pandas (cont.)¶

california_df.head()  # peek at first 5 rows

15.5.2 Exploring the Data with a Pandas `DataFrame` (cont.)¶

Calculate DataFrame’s summary statistics
Median income and house values are from 1990 and are significantly higher today
Output is left-to-right scrollable in Jupyter if it does not fit in your screen width

california_df.describe()

15.5.3 Visualizing the Features¶

Helpful to visualize data by plotting the target value against each feature Shows how median home value relates to each feature
To make our visualizations clearer, let’s use DataFrame method sample** to randomly select 10% of the 20,640 samples** for graphing

sample_df = california_df.sample(frac=0.1, random_state=17)

Display scatter plots of several features
Each shows feature on x-axis and median home value on y-axis

import matplotlib.pyplot as plt

import seaborn as sns

# sns.set(font_scale=2)

sns.set_style('whitegrid')

for feature in california.feature_names:
    plt.figure(figsize=(8, 4.5))  # 8"-by-4.5" Figure
    sns.scatterplot(data=sample_df, x=feature, 
                    y='MedHouseValue', hue='MedHouseValue', 
                    palette='cool', legend=False)

15.5.3 Visualizing the Features (cont.)¶

Some interesting things to notice in these graphs:
- Latitude and longitude graphs each have two areas of especially significant density—greater Los Angeles and greater San Francisco areas where house prices tend to be higher
- Each graph shows a horizontal line of dots at the y-axis value 5, which represents the maximum median house value \$500,000 listed in the 1990 census form
- HouseAge graph shows a vertical line of dots at the x-axis value 52
  - Highest home age on the 1990 census form was 52

15.5.4 Splitting the Data for Training and Testing Using `train_test_split`¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    california.data, california.target, random_state=11)

X_train.shape

(15480, 8)

X_test.shape

(5160, 8)

15.5.5 Training the Model¶

LinearRegression tries to use all features in a dataset’s data array
- error if any features are categorical
- Categorical data must be preprocessed into numerical data or excluded
Scikit-learn’s bundled datasets are already in the correct format for training

from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()

linear_regression.fit(X=X_train, y=y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

15.5.5 Training the Model (cont.)¶

Separate coefficients for each feature (stored in coeff_) and one intercept (stored in intercept_)
- Positive coefficients — median house value increases as feature value increases
- Negative coefficients — median house value decreases as feature value increases
- HouseAge, AveOccup and Population are close to zero, so these apparently have little to no affect on median house value

15.5.5 Training the Model (cont.)¶

Can use coefficient values in following equation to make predictions:

\begin{equation} y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b \end{equation}

m₁, m₂, …, m_n are the feature coefficients
b is the intercept
x₁, x₂, …, x_n are feature values (the independent variables)
y is the predicted value (the dependent variable)

linear_regression.coef_

array([ 4.37703022e-01,  9.21683457e-03, -1.07325266e-01,  6.11713307e-01,
       -5.75682201e-06, -3.38456647e-03, -4.19481861e-01, -4.33771335e-01])

for i, name in enumerate(california.feature_names):
    print(f'{name:>10}: {linear_regression.coef_[i]}')

    MedInc: 0.4377030215382206
  HouseAge: 0.009216834565797713
  AveRooms: -0.10732526637360985
 AveBedrms: 0.611713307391811
Population: -5.756822009298454e-06
  AveOccup: -0.0033845664657163703
  Latitude: -0.419481860964907
 Longitude: -0.4337713349874016

linear_regression.intercept_

-36.88295065605547

15.5.6 Testing the Model¶

Use the estimator’s predict method

predicted = linear_regression.predict(X_test)

expected = y_test

predicted[:5]  # first 5 predictions

array([1.25396876, 2.34693107, 2.03794745, 1.8701254 , 2.53608339])

expected[:5]   # first five targets

array([0.762, 1.732, 1.125, 1.37 , 1.856])

15.5.6 Testing the Model with the Estimator’s `predict`Method (cont.)¶

In classification, predictions were distinct classes that matched existing classes in the dataset
In regression, it’s tough to get exact predictions, because you have continuous outputs
- Every possible value of x₁, x₂ … x_n in the following calculation predicts a value

\begin{equation} y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b \end{equation}

15.5.7 Visualizing the Expected vs. Predicted Prices¶

Create a DataFrame containing columns for the expected and predicted values:

df = pd.DataFrame()

df['Expected'] = pd.Series(expected)

df['Predicted'] = pd.Series(predicted)

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)¶

Plot the data as a scatter plot with the expected (target) prices along the x-axis and the predicted prices along the y-axis:

figure = plt.figure(figsize=(9, 9))

axes = sns.scatterplot(data=df, x='Expected', y='Predicted', 
    hue='Predicted', palette='cool', legend=False)

start = min(expected.min(), predicted.min())

end = max(expected.max(), predicted.max())

axes.set_xlim(start, end)

axes.set_ylim(start, end)

line = plt.plot([start, end], [start, end], 'k--')

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)¶

Set the x- and y-axes’ limits to use the same scale along both axes:
Plot a line that represents perfect predictions (this is not a regression line).
- The call to plot displays a line between the points representing the lower-left corner of the graph (start, start) and the upper-right corner of the graph (end, end).
- The third argument ('k--') indicates the line’s style.
- The letter k represents the color black, and the -- indicates that plot should draw a dashed line:
If every predicted value were to match the expected value, then all the dots would be plotted along the dashed line.
Appears that as the expected median house value increases, more of the predicted values fall below the line.
So the model seems to predict lower median house values as the expected median house value increases.

15.5.8 Regression Model Metrics¶

Metrics for regression estimators include coefficient of determination ($R^{2}$ score; 0.0-1.0)
- 1.0 — estimator perfectly predicts the dependent variable’s value, given independent variables' values
- 0.0 — model cannot make predictions with any accuracy, given independent variables’ values
Calculate with arrays representing the expected and predicted results

from sklearn import metrics

metrics.r2_score(expected, predicted)

0.6008983115964333

15.5.9 Choosing the Best Model¶

Try several estimators to determine whether any produces better results than LinearRegression
Information about estimators used here

from sklearn.linear_model import ElasticNet, Lasso, Ridge

estimators = {
    'LinearRegression': linear_regression,
    'ElasticNet': ElasticNet(),
    'Lasso': Lasso(),
    'Ridge': Ridge()
}

15.5.9 Choosing the Best Model (cont.)¶

Run the estimators using k-fold cross-validation
cross_val_score argument scoring='r2' — report $R^{2}$ scores for each fold
- 1.0 is best, so LinearRegression and Ridge appear to be best models for this dataset

from sklearn.model_selection import KFold, cross_val_score

for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator=estimator_object, 
        X=california.data, y=california.target, cv=kfold,
        scoring='r2')
    print(f'{estimator_name:>16}: ' + 
          f'mean of r2 scores={scores.mean():.3f}')

LinearRegression: mean of r2 scores=0.599
      ElasticNet: mean of r2 scores=0.423
           Lasso: mean of r2 scores=0.285
           Ridge: mean of r2 scores=0.599

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseValue
0	8.3252	41.0	6.9841	1.0238	322.0	2.5556	37.88	-122.23	4.526
1	8.3014	21.0	6.2381	0.9719	2401.0	2.1098	37.86	-122.22	3.585
2	7.2574	52.0	8.2881	1.0734	496.0	2.8023	37.85	-122.24	3.521
3	5.6431	52.0	5.8174	1.0731	558.0	2.5479	37.85	-122.25	3.413
4	3.8462	52.0	6.2819	1.0811	565.0	2.1815	37.85	-122.25	3.422

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseValue
count	20640.0000	20640.0000	20640.0000	20640.0000	20640.0000	20640.0000	20640.0000	20640.0000	20640.0000
mean	3.8707	28.6395	5.4290	1.0967	1425.4767	3.0707	35.6319	-119.5697	2.0686
std	1.8998	12.5856	2.4742	0.4739	1132.4621	10.3860	2.1360	2.0035	1.1540
min	0.4999	1.0000	0.8462	0.3333	3.0000	0.6923	32.5400	-124.3500	0.1500
25%	2.5634	18.0000	4.4407	1.0061	787.0000	2.4297	33.9300	-121.8000	1.1960
50%	3.5348	29.0000	5.2291	1.0488	1166.0000	2.8181	34.2600	-118.4900	1.7970
75%	4.7432	37.0000	6.0524	1.0995	1725.0000	3.2823	37.7100	-118.0100	2.6472
max	15.0001	52.0000	141.9091	34.0667	35682.0000	1243.3333	41.9500	-114.3100	5.0000

15.5 Case Study: Multiple Linear Regression with the California Housing Dataset¶

15.5.1 Loading the Dataset (1 of 3)¶

15.5.1 Loading the Dataset (2 of 3)¶

15.5.1 Loading the Dataset (3 of 3)¶

Loading the Data¶

Displaying the Dataset’s Description¶

15.5.2 Exploring the Data with a Pandas¶

15.5.2 Exploring the Data with a Pandas (cont.)¶

15.5.2 Exploring the Data with a Pandas DataFrame (cont.)¶

15.5.3 Visualizing the Features¶

15.5.3 Visualizing the Features (cont.)¶

15.5.4 Splitting the Data for Training and Testing Using train_test_split¶

15.5.5 Training the Model¶

15.5.5 Training the Model (cont.)¶

15.5.5 Training the Model (cont.)¶

15.5.6 Testing the Model¶

15.5.6 Testing the Model with the Estimator’s predictMethod (cont.)¶

15.5.7 Visualizing the Expected vs. Predicted Prices¶

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)¶

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)¶

15.5.8 Regression Model Metrics¶

15.5.9 Choosing the Best Model¶

15.5.9 Choosing the Best Model (cont.)¶

15.5.2 Exploring the Data with a Pandas `DataFrame` (cont.)¶

15.5.4 Splitting the Data for Training and Testing Using `train_test_split`¶

15.5.6 Testing the Model with the Estimator’s `predict`Method (cont.)¶