15.5 Case Study: Multiple Linear Regression with the California Housing Dataset

  • California Housing dataset bundled with scikit-learn
  • Larger real-world dataset 20,640 samples, each with eight numerical features
    • Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297. Submitted to the StatLib Datasets Archive by Kelley Pace (kpace@unix1.sncc.lsu.edu). [9/Nov/99].
  • Perform multiple linear regression using all eight numerical features
    • Make more sophisticated housing price predictions than if we were to use only a single feature or a subset of the features
  • LinearRegression estimator performs multiple linear regression by default

15.5.1 Loading the Dataset (1 of 3)

  • According to the California Housing Prices dataset’s description in scikit-learn

    "This dataset was derived from the 1990 U.S. census, using one row per census block group.
    "A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (typically has a population of 600 to 3,000 people)."


15.5.1 Loading the Dataset (2 of 3)

  • The dataset has 20,640 samplesone per block group—with eight features each:
    • median income—in tens of thousands, so 8.37 would represent $83,700
    • median house age—in the dataset, the maximum value for this feature is 52
    • average number of rooms
    • average number of bedrooms
    • block population
    • average house occupancy
    • house block latitude
    • house block longitude

15.5.1 Loading the Dataset (3 of 3)

  • Targetmedian house value in hundreds of thousands, so 3.55 would represent \$355,000
    • Maximum for this feature is 5 for \$500,000
  • Reasonable to expect more bedrooms, more rooms or higher income would mean higher house value
  • Combine all numeric features to make predictions
    • More likely to get more accurate predictions than with simple linear regression

Loading the Data

  • Use sklearn.datasets function fetch_california_housing
  • We added %matplotlib inline to enable Matplotlib in this notebook.
In [1]:
%matplotlib inline
from sklearn.datasets import fetch_california_housing
In [2]:
california = fetch_california_housing()  # Bunch object 
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\sakyokus\scikit_learn_data
In [3]:
print(california.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

Displaying the Dataset’s Description

In [4]:
print(california.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

  • Confirm number of samples/features, number of targets, feature names
In [3]:
california.data.shape
Out[3]:
(20640, 8)
In [4]:
california.target.shape
Out[4]:
(20640,)
In [5]:
california.feature_names
Out[5]:
['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

15.5.2 Exploring the Data with a Pandas

In [6]:
import pandas as pd
In [7]:
pd.set_option('precision', 4)  # 4 digit precision for floats
In [ ]:
# Used for command line outputs in IPython interactive mode
#pd.set_option('max_columns', 9)  # display up to 9 columns in DataFrame outputs

#pd.set_option('display.width', None)  # auto-detect the display width for wrapping
  • Second statement adds a DataFrame column for median house values
In [8]:
california_df = pd.DataFrame(california.data, 
                             columns=california.feature_names)
In [9]:
california_df['MedHouseValue'] = pd.Series(california.target)

15.5.2 Exploring the Data with a Pandas (cont.)

In [12]:
california_df.head()  # peek at first 5 rows
Out[12]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseValue
0 8.3252 41.0 6.9841 1.0238 322.0 2.5556 37.88 -122.23 4.526
1 8.3014 21.0 6.2381 0.9719 2401.0 2.1098 37.86 -122.22 3.585
2 7.2574 52.0 8.2881 1.0734 496.0 2.8023 37.85 -122.24 3.521
3 5.6431 52.0 5.8174 1.0731 558.0 2.5479 37.85 -122.25 3.413
4 3.8462 52.0 6.2819 1.0811 565.0 2.1815 37.85 -122.25 3.422

15.5.2 Exploring the Data with a Pandas DataFrame (cont.)

  • Calculate DataFrame’s summary statistics
  • Median income and house values are from 1990 and are significantly higher today
  • Output is left-to-right scrollable in Jupyter if it does not fit in your screen width
In [10]:
california_df.describe()
Out[10]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseValue
count 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000 20640.0000
mean 3.8707 28.6395 5.4290 1.0967 1425.4767 3.0707 35.6319 -119.5697 2.0686
std 1.8998 12.5856 2.4742 0.4739 1132.4621 10.3860 2.1360 2.0035 1.1540
min 0.4999 1.0000 0.8462 0.3333 3.0000 0.6923 32.5400 -124.3500 0.1500
25% 2.5634 18.0000 4.4407 1.0061 787.0000 2.4297 33.9300 -121.8000 1.1960
50% 3.5348 29.0000 5.2291 1.0488 1166.0000 2.8181 34.2600 -118.4900 1.7970
75% 4.7432 37.0000 6.0524 1.0995 1725.0000 3.2823 37.7100 -118.0100 2.6472
max 15.0001 52.0000 141.9091 34.0667 35682.0000 1243.3333 41.9500 -114.3100 5.0000

15.5.3 Visualizing the Features

  • Helpful to visualize data by plotting the target value against each feature Shows how median home value relates to each feature
  • To make our visualizations clearer, let’s use DataFrame method sample** to randomly select 10% of the 20,640 samples** for graphing
In [11]:
sample_df = california_df.sample(frac=0.1, random_state=17)
  • Display scatter plots of several features
  • Each shows feature on x-axis and median home value on y-axis
In [12]:
import matplotlib.pyplot as plt
In [13]:
import seaborn as sns
In [16]:
# sns.set(font_scale=2)
In [14]:
sns.set_style('whitegrid')                                    
In [15]:
for feature in california.feature_names:
    plt.figure(figsize=(8, 4.5))  # 8"-by-4.5" Figure
    sns.scatterplot(data=sample_df, x=feature, 
                    y='MedHouseValue', hue='MedHouseValue', 
                    palette='cool', legend=False)

15.5.3 Visualizing the Features (cont.)

  • Some interesting things to notice in these graphs:
    • Latitude and longitude graphs each have two areas of especially significant density—greater Los Angeles and greater San Francisco areas where house prices tend to be higher
    • Each graph shows a horizontal line of dots at the y-axis value 5, which represents the maximum median house value \$500,000 listed in the 1990 census form
    • HouseAge graph shows a vertical line of dots at the x-axis value 52
      • Highest home age on the 1990 census form was 52

15.5.4 Splitting the Data for Training and Testing Using train_test_split

In [16]:
from sklearn.model_selection import train_test_split
In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    california.data, california.target, random_state=11)
In [18]:
X_train.shape
Out[18]:
(15480, 8)
In [19]:
X_test.shape
Out[19]:
(5160, 8)

15.5.5 Training the Model

  • LinearRegression tries to use all features in a dataset’s data array
    • error if any features are categorical
    • Categorical data must be preprocessed into numerical data or excluded
  • Scikit-learn’s bundled datasets are already in the correct format for training
In [20]:
from sklearn.linear_model import LinearRegression
In [21]:
linear_regression = LinearRegression()
In [22]:
linear_regression.fit(X=X_train, y=y_train)
Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

15.5.5 Training the Model (cont.)

  • Separate coefficients for each feature (stored in coeff_) and one intercept (stored in intercept_)
    • Positive coefficients — median house value increases as feature value increases
    • Negative coefficients — median house value decreases as feature value increases
    • HouseAge, AveOccup and Population are close to zero, so these apparently have little to no affect on median house value

15.5.5 Training the Model (cont.)

  • Can use coefficient values in following equation to make predictions:
\begin{equation} y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b \end{equation}
  • m1, m2, …, mn are the feature coefficients
  • b is the intercept
  • x1, x2, …, xn are feature values (the independent variables)
  • y is the predicted value (the dependent variable)
In [23]:
linear_regression.coef_
Out[23]:
array([ 4.37703022e-01,  9.21683457e-03, -1.07325266e-01,  6.11713307e-01,
       -5.75682201e-06, -3.38456647e-03, -4.19481861e-01, -4.33771335e-01])
In [24]:
for i, name in enumerate(california.feature_names):
    print(f'{name:>10}: {linear_regression.coef_[i]}')  
    MedInc: 0.4377030215382206
  HouseAge: 0.009216834565797713
  AveRooms: -0.10732526637360985
 AveBedrms: 0.611713307391811
Population: -5.756822009298454e-06
  AveOccup: -0.0033845664657163703
  Latitude: -0.419481860964907
 Longitude: -0.4337713349874016
In [25]:
linear_regression.intercept_
Out[25]:
-36.88295065605547

15.5.6 Testing the Model

  • Use the estimator’s predict method
In [26]:
predicted = linear_regression.predict(X_test)
In [28]:
expected = y_test
In [29]:
predicted[:5]  # first 5 predictions
Out[29]:
array([1.25396876, 2.34693107, 2.03794745, 1.8701254 , 2.53608339])
In [30]:
expected[:5]   # first five targets 
Out[30]:
array([0.762, 1.732, 1.125, 1.37 , 1.856])

15.5.6 Testing the Model with the Estimator’s predictMethod (cont.)

  • In classification, predictions were distinct classes that matched existing classes in the dataset
  • In regression, it’s tough to get exact predictions, because you have continuous outputs
    • Every possible value of x1, x2xn in the following calculation predicts a value
\begin{equation} y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b \end{equation}

15.5.7 Visualizing the Expected vs. Predicted Prices

  • Create a DataFrame containing columns for the expected and predicted values:
In [31]:
df = pd.DataFrame()
In [32]:
df['Expected'] = pd.Series(expected)
In [33]:
df['Predicted'] = pd.Series(predicted)

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)

  • Plot the data as a scatter plot with the expected (target) prices along the x-axis and the predicted prices along the y-axis:
In [34]:
figure = plt.figure(figsize=(9, 9))

axes = sns.scatterplot(data=df, x='Expected', y='Predicted', 
    hue='Predicted', palette='cool', legend=False)

start = min(expected.min(), predicted.min())

end = max(expected.max(), predicted.max())

axes.set_xlim(start, end)

axes.set_ylim(start, end)

line = plt.plot([start, end], [start, end], 'k--')

15.5.7 Visualizing the Expected vs. Predicted Prices (cont.)

  • Set the x- and y-axes’ limits to use the same scale along both axes:
  • Plot a line that represents perfect predictions (this is not a regression line).
    • The call to plot displays a line between the points representing the lower-left corner of the graph (start, start) and the upper-right corner of the graph (end, end).
    • The third argument ('k--') indicates the line’s style.
    • The letter k represents the color black, and the -- indicates that plot should draw a dashed line:
  • If every predicted value were to match the expected value, then all the dots would be plotted along the dashed line.
  • Appears that as the expected median house value increases, more of the predicted values fall below the line.
  • So the model seems to predict lower median house values as the expected median house value increases.

15.5.8 Regression Model Metrics

  • Metrics for regression estimators include coefficient of determination ($R^{2}$ score; 0.0-1.0)
    • 1.0 — estimator perfectly predicts the dependent variable’s value, given independent variables' values
    • 0.0model cannot make predictions with any accuracy, given independent variables’ values
  • Calculate with arrays representing the expected and predicted results
In [35]:
from sklearn import metrics
In [36]:
metrics.r2_score(expected, predicted)
Out[36]:
0.6008983115964333

15.5.9 Choosing the Best Model

In [48]:
from sklearn.linear_model import ElasticNet, Lasso, Ridge
In [49]:
estimators = {
    'LinearRegression': linear_regression,
    'ElasticNet': ElasticNet(),
    'Lasso': Lasso(),
    'Ridge': Ridge()
}

15.5.9 Choosing the Best Model (cont.)

  • Run the estimators using k-fold cross-validation
  • cross_val_score argument scoring='r2' — report $R^{2}$ scores for each fold
    • 1.0 is best, so LinearRegression and Ridge appear to be best models for this dataset
In [50]:
from sklearn.model_selection import KFold, cross_val_score
In [51]:
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator=estimator_object, 
        X=california.data, y=california.target, cv=kfold,
        scoring='r2')
    print(f'{estimator_name:>16}: ' + 
          f'mean of r2 scores={scores.mean():.3f}')
LinearRegression: mean of r2 scores=0.599
      ElasticNet: mean of r2 scores=0.423
           Lasso: mean of r2 scores=0.285
           Ridge: mean of r2 scores=0.599

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.