LinearRegression
estimator performs multiple linear regression by default"This dataset was derived from the 1990 U.S. census, using one row per census block group.
"A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (typically has a population of 600 to 3,000 people)."
sklearn.datasets
function fetch_california_housing
%matplotlib inline
to enable Matplotlib in this notebook.%matplotlib inline
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing() # Bunch object
print(california.DESCR)
print(california.DESCR)
california.data.shape
california.target.shape
california.feature_names
import pandas as pd
pd.set_option('precision', 4) # 4 digit precision for floats
# Used for command line outputs in IPython interactive mode
#pd.set_option('max_columns', 9) # display up to 9 columns in DataFrame outputs
#pd.set_option('display.width', None) # auto-detect the display width for wrapping
DataFrame
column for median house valuescalifornia_df = pd.DataFrame(california.data,
columns=california.feature_names)
california_df['MedHouseValue'] = pd.Series(california.target)
california_df.head() # peek at first 5 rows
DataFrame
(cont.)¶DataFrame
’s summary statistics california_df.describe()
DataFrame
method sample
** to randomly select 10% of the 20,640 samples** for graphingsample_df = california_df.sample(frac=0.1, random_state=17)
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set(font_scale=2)
sns.set_style('whitegrid')
for feature in california.feature_names:
plt.figure(figsize=(8, 4.5)) # 8"-by-4.5" Figure
sns.scatterplot(data=sample_df, x=feature,
y='MedHouseValue', hue='MedHouseValue',
palette='cool', legend=False)
HouseAge
graph shows a vertical line of dots at the x-axis value 52train_test_split
¶from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
california.data, california.target, random_state=11)
X_train.shape
X_test.shape
LinearRegression
tries to use all features in a dataset’s data
arrayfrom sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X=X_train, y=y_train)
coeff_
) and one intercept (stored in intercept_
) linear_regression.coef_
for i, name in enumerate(california.feature_names):
print(f'{name:>10}: {linear_regression.coef_[i]}')
linear_regression.intercept_
predict
method predicted = linear_regression.predict(X_test)
expected = y_test
predicted[:5] # first 5 predictions
expected[:5] # first five targets
predict
Method (cont.)¶DataFrame
containing columns for the expected and predicted values:df = pd.DataFrame()
df['Expected'] = pd.Series(expected)
df['Predicted'] = pd.Series(predicted)
figure = plt.figure(figsize=(9, 9))
axes = sns.scatterplot(data=df, x='Expected', y='Predicted',
hue='Predicted', palette='cool', legend=False)
start = min(expected.min(), predicted.min())
end = max(expected.max(), predicted.max())
axes.set_xlim(start, end)
axes.set_ylim(start, end)
line = plt.plot([start, end], [start, end], 'k--')
plot
displays a line between the points representing the lower-left corner of the graph (start, start
) and the upper-right corner of the graph (end, end
). 'k--'
) indicates the line’s style. k
represents the color black, and the --
indicates that plot should draw a dashed line: from sklearn import metrics
metrics.r2_score(expected, predicted)
LinearRegression
from sklearn.linear_model import ElasticNet, Lasso, Ridge
estimators = {
'LinearRegression': linear_regression,
'ElasticNet': ElasticNet(),
'Lasso': Lasso(),
'Ridge': Ridge()
}
cross_val_score
argument scoring='r2'
— report $R^{2}$ scores for each foldLinearRegression
and Ridge
appear to be best models for this datasetfrom sklearn.model_selection import KFold, cross_val_score
for estimator_name, estimator_object in estimators.items():
kfold = KFold(n_splits=10, random_state=11, shuffle=True)
scores = cross_val_score(estimator=estimator_object,
X=california.data, y=california.target, cv=kfold,
scoring='r2')
print(f'{estimator_name:>16}: ' +
f'mean of r2 scores={scores.mean():.3f}')
©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.
DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.