In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings('ignore')
In [2]:
housing = pd.read_csv('./datasets/housing/housing.csv')
In [3]:
housing.head()
Out[3]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
  • 数据之间的量级相差很大。
  • ocean_proximity 是 categorical value
In [4]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
  • total_bedrooms 存在一些缺失值。
In [5]:
housing.describe()
Out[5]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [6]:
housing.hist(bins=50, figsize=(20,15))
plt.show()
  • households, population, total_bedrooms, total_rooms 都有比较长的拖尾。可以考虑做log转换
  • housing_median_age,median_house_value 最大值数据存在winsorization,也即把更高的数据统一设置成同一个值

Exploratory data analysis

  • 所有数据项目,包括计量经济学项目,数据探索(exploratory data analysis, EDA)都是重要的一步。
  • 在预测问题里,EDA 更重要,因为不必过分拘泥于前人的数据处理方法。在计量经济学项目中,如果前人的文献里包括了某个变量,那么很可能你也必须要包括这个变量。在预测问题里,这不是必然的。因此,充分探索数据,找到能使你的模型表现更好的变量或者变量组合更重要
In [7]:
numeric_features = ["housing_median_age","total_rooms","total_bedrooms",
                    "population","households","median_income","median_house_value"]
In [8]:
sns.pairplot(housing[numeric_features])
Out[8]:
<seaborn.axisgrid.PairGrid at 0x1a1b3d85d0>
  • median_income 和 median_house_value 有正向关系。
  • 因为数据的 winsorization,散点图中出现了很多点聚集在同一条水平线上的情况
  • 考虑“家庭平均”,“人均”的变量组合
In [9]:
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_household'] = housing['total_bedrooms'] / housing['households']
housing['rooms_per_population'] = housing['total_rooms'] / housing['population']
housing['bedrooms_per_population'] = housing['total_bedrooms'] / housing['population']
housing['population_per_household'] = housing['population'] / housing['households']
In [10]:
numeric_features = numeric_features + ['rooms_per_household','bedrooms_per_household',
                                       'rooms_per_population','bedrooms_per_population',
                                       'population_per_household']
                        
In [11]:
numeric_features
Out[11]:
['housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'rooms_per_household',
 'bedrooms_per_household',
 'rooms_per_population',
 'bedrooms_per_population',
 'population_per_household']

查看和median_house_value的相关性

In [12]:
housing.corr()['median_house_value'].sort_values(ascending=False)
Out[12]:
median_house_value          1.000000
median_income               0.688075
rooms_per_population        0.209482
rooms_per_household         0.151948
total_rooms                 0.134153
housing_median_age          0.105623
bedrooms_per_population     0.069896
households                  0.065843
total_bedrooms              0.049686
population_per_household   -0.023737
population                 -0.024650
longitude                  -0.045967
bedrooms_per_household     -0.046739
latitude                   -0.144160
Name: median_house_value, dtype: float64
  • median_income 和 median_house_value 相关性比较强
  • 纬度也和 median_house_value 有较强相关性
  • 上面只考虑了数值变量之间的联系。下面考察一下ocean_proximity,也即离海远近和房价之间的关系
In [13]:
housing.groupby('ocean_proximity').median_house_value.describe()
Out[13]:
count mean std min 25% 50% 75% max
ocean_proximity
<1H OCEAN 9136.0 240084.285464 106124.292213 17500.0 164100.0 214850.0 289100.0 500001.0
INLAND 6551.0 124805.392001 70007.908494 14999.0 77500.0 108500.0 148950.0 500001.0
ISLAND 5.0 380440.000000 80559.561816 287500.0 300000.0 414700.0 450000.0 450000.0
NEAR BAY 2290.0 259212.311790 122818.537064 22500.0 162500.0 233800.0 345700.0 500001.0
NEAR OCEAN 2658.0 249433.977427 122477.145927 22500.0 150000.0 229450.0 322750.0 500001.0

很明显,靠海的房子更值钱

train test split

  • 简单的策略:把数据8-2划分,80%的随机样本用作训练,剩下的20%来衡量模型好坏。
  • 也可以做更复杂的划分,比如继续把训练样本划分为train, valiation。或者考虑数据的分布情况做抽样划分。
  • 简单起见,这里做8-2划分
  • 可以自己做随机抽样,把数据按比例划出来。也可以用sklearn中的train_test_split
In [14]:
train, test = train_test_split(housing, test_size=0.2, random_state=42)
In [15]:
train_X = train.drop('median_house_value', axis=1)
train_y = train['median_house_value'].copy()
test_X = test.drop('median_house_value', axis=1)
test_y = test['median_house_value'].copy()

Data cleaning

根据以上的考察,有以下几个需要注意的地方:

  • 缺失值处理
  • categorical feature,也即 ocean_proximity 处理
  • 量级统一

Missing values

对于缺失值,这里采取用median填充

In [16]:
imputer = SimpleImputer(strategy='median')
train_X_num = train_X.drop('ocean_proximity', axis=1)
train_X_num = imputer.fit_transform(train_X_num)
In [17]:
test_X_num = test_X.drop('ocean_proximity', axis=1)
test_X_num = imputer.transform(test_X_num)

Categorical feature

一般有两种处理分类变量的方法。

  • 第一种是给每一类编号,比如'<1H OCEAN'编为0,'INLAND'编成1,以此类推。
  • 第二种是虚拟变量。
  • 第一类的缺点是编号之间实际上没有数量关系,但算法不会这样认为。例如,'<1H OCEAN'是0,'INLAND'是1,但这里的0和1并没有数量上的关系。
  • 虚拟变量的缺点是分类太多时可能会有很多的变量被做出来。

这里用虚拟变量的方法处理ocean_proximity

In [18]:
train_X_cat = train_X[['ocean_proximity']]
one_hot = OneHotEncoder()
train_X_cat = one_hot.fit_transform(train_X_cat)
In [19]:
train_X_cat = train_X_cat.toarray()
In [20]:
test_X_cat = test_X[['ocean_proximity']]
test_X_cat = one_hot.transform(test_X_cat)
test_X_cat = test_X_cat.toarray()

Feature scaling

对于量级,这里采取Standard Scaler的处理办法,也即对每一个变量,减去其均值,除以其标准差

In [21]:
std_scaler = StandardScaler()
train_X_num = std_scaler.fit_transform(train_X_num)
In [22]:
test_X_num = std_scaler.transform(test_X_num)

Combine numerical and categorical columns

In [23]:
train_X_num.shape
Out[23]:
(16512, 13)
In [24]:
train_X_cat.shape
Out[24]:
(16512, 5)
In [25]:
train_X_prepared = np.hstack((train_X_num, train_X_cat))
train_X_prepared.shape
Out[25]:
(16512, 18)
In [26]:
test_X_prepared = np.hstack((test_X_num, test_X_cat))
In [27]:
test_X_num.shape
Out[27]:
(4128, 13)
In [28]:
test_X_cat.shape
Out[28]:
(4128, 5)
In [29]:
test_X_prepared.shape
Out[29]:
(4128, 18)

Fit models

Linear regression

In [30]:
lin_reg = LinearRegression()
lin_reg.fit(train_X_prepared, train_y)
Out[30]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

考察模型表现

In [31]:
lin_reg_predict = lin_reg.predict(train_X_prepared)
lin_reg_rmse = np.sqrt(mean_squared_error(train_y, lin_reg_predict))
In [32]:
lin_reg_rmse
Out[32]:
66880.48117211954

Decision tree

In [33]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_X_prepared, train_y)
Out[33]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [34]:
tree_predict = tree_reg.predict(train_X_prepared)
tree_rmse = np.sqrt(mean_squared_error(train_y, tree_predict))
In [35]:
tree_rmse
Out[35]:
0.0

很显然,决策树给出的模型产生了严重的过拟合。原因在于,sklearn默认的DecisionTreeRegressor对于很多参数不做任何限制。

如果我们对决策树的深度做出一定的限制,看看结果如何:

In [36]:
tree_reg = DecisionTreeRegressor(max_depth=5)
tree_reg.fit(train_X_prepared, train_y)
Out[36]:
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [37]:
tree_predict = tree_reg.predict(train_X_prepared)
tree_rmse = np.sqrt(mean_squared_error(train_y, tree_predict))
print(tree_rmse)
66023.86709260463

如何确定参数?可以把训练数据进一步分组进行 cross validation.

Decision tree with cross validation

我们考虑 $3*3 = 9$ 种参数组合。对于每一种参数组合,我们把训练数据分成5组,在其中4组种训练模型,剩下的1组上评估模型表现,并取5组表现的均值。然后选出表现最优的一组参数作为决策树的参数。

In [38]:
param_grid = [
    {'max_depth': [5, 10, 20], 'max_features': [2,5,8]}
]
In [39]:
tree_reg = DecisionTreeRegressor()
grid_search = GridSearchCV(tree_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
In [40]:
grid_search.fit(train_X_prepared, train_y)
Out[40]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                             max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort=False, random_state=None,
                                             splitter='best'),
             iid='warn', n_jobs=None,
             param_grid=[{'max_depth': [5, 10, 20], 'max_features': [2, 5, 8]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='neg_mean_squared_error', verbose=0)
In [41]:
grid_search.best_params_
Out[41]:
{'max_depth': 10, 'max_features': 8}

由此可以确定,在我们选定的9组候选参数中,max_depth设为10,max_features设为8,决策树的表现是最好的。如果我们把备选参数增多,则可以选出更优的模型,但程序运行的时间也会相应变长。

In [42]:
cv_results = grid_search.cv_results_
for mean_score, params in zip(cv_results['mean_test_score'],
                              cv_results['params']):
    print(np.sqrt(-mean_score), params)
86757.34430466479 {'max_depth': 5, 'max_features': 2}
74255.1496877548 {'max_depth': 5, 'max_features': 5}
70013.81721000801 {'max_depth': 5, 'max_features': 8}
76453.35269735787 {'max_depth': 10, 'max_features': 2}
64694.85416648749 {'max_depth': 10, 'max_features': 5}
63196.29853467694 {'max_depth': 10, 'max_features': 8}
80520.13049318209 {'max_depth': 20, 'max_features': 2}
74100.52042887225 {'max_depth': 20, 'max_features': 5}
72120.28962193917 {'max_depth': 20, 'max_features': 8}
In [43]:
tree_reg = DecisionTreeRegressor(max_depth=10,max_features=8)
tree_reg.fit(train_X_prepared, train_y)
Out[43]:
DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=8,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

Ensemble methods

这里我们考察最简单的ensemble模型,Random Forest, 随机森林

In [44]:
forest_reg = RandomForestRegressor()
forest_reg.fit(train_X_prepared, train_y)
Out[44]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)
In [45]:
forest_predict = forest_reg.predict(train_X_prepared)
forest_rmse = np.sqrt(mean_squared_error(train_y,forest_predict))
In [46]:
print(forest_rmse)
22565.545962983935

可以看到,随机森林给出的均方误差是最小的,同时,这个误差也没有像第一次的决策树那样不合理(为0)。因此,直观上,随机森林是最好的模型。

Evaluation on the test set

在测试集上最终看一下模型的表现优劣

In [47]:
test_lin_predict = lin_reg.predict(test_X_prepared)
test_tree_predict = tree_reg.predict(test_X_prepared)
test_forest_predict = forest_reg.predict(test_X_prepared)
test_lin_rmse = np.sqrt(mean_squared_error(test_y, test_lin_predict))
test_tree_rmse = np.sqrt(mean_squared_error(test_y, test_tree_predict))
test_forest_rmse = np.sqrt(mean_squared_error(test_y, test_forest_predict))

print('Linear Regression rmse on test set:', test_lin_rmse)
print('Decision Tree regression rmse on test set:', test_tree_rmse)
print('Random Forest regression rmse on test set:', test_forest_rmse)
Linear Regression rmse on test set: 67907.85659044748
Decision Tree regression rmse on test set: 62741.426654136616
Random Forest regression rmse on test set: 52558.16304460965

可以看到,在测试集上,Random Forest 的表现是最好的。