随机森林在泰坦尼克号数据上的调参

调参基本知识

模型复杂度和泛化误差: 当模型太复杂,模型就会过拟合,泛化能力不够,导致泛化误差大;当模型简单,模型就会欠拟合,拟合能力不够,导致泛化误差也大;只有当模型复杂度刚刚好的时候才能够达到泛化误差最小的目标。

1.png

树模型是天生复杂度高的模型,我们调整参数,基本是将模型往图像的左边推动。调参的时候一般按照参数的重要程度依次调整,随机森林参数的重要程度如下:

2.png

参考

导包

1
2
3
4
5
6
7
8
9
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

读取并探索数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
data = pd.read_csv(r'data.csv')
print(data.info())

# output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
1
data.head()

1.png

1
2
3
4
# 将标签类移动到最后一列
tmp = data.pop('Survived')
data.insert(len(data.columns), 'Survived', tmp)
data.head()

1.png

数据预处理

1
2
3
# 筛选特征,删除缺失值过多的列,和观察判断来说和预测的y没有关系的列
data.drop(["Cabin", "Name", "Ticket"], axis=1, inplace=True)
data.head()

1.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data.info()

# output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
Survived 891 non-null int64
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 去掉“Embarked”缺失值部分所在的行
data.dropna(subset=['Embarked'], inplace=True)
data.info()

# output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 889 non-null int64
Pclass 889 non-null int64
Sex 889 non-null object
Age 712 non-null float64
SibSp 889 non-null int64
Parch 889 non-null int64
Fare 889 non-null float64
Embarked 889 non-null object
Survived 889 non-null int64
dtypes: float64(2), int64(5), object(2)
memory usage: 69.5+ KB
1
2
3
4
5
6
7
8
# 字符串特征转数值特征,这里只是简单的情况,详细的要再多多学习
labels = data['Embarked'].unique().tolist()
data['Embarked'] = data['Embarked'].apply(lambda x : labels.index(x))

# 'Sex' 只有两个取值,讨巧的做法
data['Sex'] = (data['Sex'] == 'male').astype("int")

data.head()

1.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 利用随机森林处理缺失值列 “Age”

age_index = data.columns.get_loc('Age')

# 构建新特征矩阵和新标签
df = data
y_new = df.iloc[:, age_index]
X_new = df.iloc[:, df.columns != 'Age']

# 找出训练集和测试集
y_train = y_new[y_new.notnull()]
y_test = y_new[y_new.isnull()]
x_train = X_new.loc[y_train.index, :]
x_test = X_new.loc[y_test.index, :]

# 用随机森林回归来填补缺失值
rtf = RandomForestRegressor(n_estimators=100)
rtf = rtf.fit(x_train, y_train)
y_predict = rtf.predict(x_test)

# 将填补好的特征返回到原始矩阵
data.loc[data.iloc[:,age_index].isnull(), 'Age'] = y_predict

data.info()

# output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 889 non-null int64
Pclass 889 non-null int64
Sex 889 non-null int32
Age 889 non-null float64
SibSp 889 non-null int64
Parch 889 non-null int64
Fare 889 non-null float64
Embarked 889 non-null int64
Survived 889 non-null int64
dtypes: float64(2), int32(1), int64(6)
memory usage: 106.0 KB
1
2
3
4
5
# 获取特征和标签
X = data.iloc[:, data.columns != 'Survived']
y = data.iloc[:, data.columns == 'Survived']

X.shape, y.shape # ((889, 8), (889, 1))
1
2
3
# 重新调整行索引
for x in [X, y]:
x.index = range(x.shape[0])

进行一次简单建模,看模型本身在数据集上的效果

1
2
3
4
rtf = RandomForestClassifier(n_estimators=100, random_state=90)
score_pre = cross_val_score(rtf, X, y, cv=10).mean()

score_pre # 0.8323799795709907

调参

参数具有范围的,可以采用网格搜索进行调参,参数没有范围的,采用学习曲线进行调整。接下来按照参数重要性依次调整。

1. n_estimators

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 利用学习曲线,确定 n_estimators 的大致范围
scores = []

for i in range(1, 201, 10):
rtf = RandomForestClassifier(n_estimators=i, n_jobs=-1, random_state=90)
score = cross_val_score(rtf, X, y, cv=10).mean()
scores.append(score)

print("最大得分: {}, max_n_estimators = {}".format(max(scores), scores.index(max(scores))*10 + 1))
plt.figure(figsize=[20, 5])
plt.plot(range(1,201,10), scores)
plt.show()

# output
# 最大得分: 0.8357507660878447, max_n_estimators = 41

1.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 在确定好的范围内,确定最佳 n_estimators
scores = []

for i in range(25, 50):
rtf = RandomForestClassifier(n_estimators=i, n_jobs=-1, random_state=90)
score = cross_val_score(rtf, X, y, cv=10).mean()
scores.append(score)

print("最大得分: {}, max_n_estimators = {}".format(max(scores), [*range(25,50)][scores.index(max(scores))]))
plt.figure(figsize=[20, 5])
plt.plot(range(25,50), scores)
plt.show()

# output
# 最大得分: 0.8391215526046987, max_n_estimators = 47

1.png

总结:准确率从 0.8323 提升到 0.8391 ,泛化误差减小,模型在图像上往左移动。

2. max_depth

1
2
3
4
5
6
7
8
# max_depth 的大小根据数据量调整
param_grid = {'max_depth': np.arange(1, 20, 1)}

rtf = RandomForestClassifier(n_estimators=47, n_jobs=-1, random_state=90)
GS = GridSearchCV(rtf, param_grid, cv=10)
GS.fit(X, y)

print(GS.best_params_, GS.best_score_) # {'max_depth': 10} 0.8413948256467941

总结: 准确率从 0.8391 提升到 0.84139 ,泛化误差减小,模型在图像上往左移动。

3. min_samples_leaf

1
2
3
4
5
6
7
param_grid = {'min_samples_leaf': np.arange(1, 20, 1)}

rtf = RandomForestClassifier(n_estimators=47, max_depth=10, n_jobs=-1, random_state=90)
GS = GridSearchCV(rtf, param_grid, cv=10)
GS.fit(X, y)

print(GS.best_params_, GS.best_score_) # {'min_samples_leaf': 3} 0.84251968503937

总结:准确率从 0.84139 提升到 0.84252 ,泛化误差减小,模型在图像上往左移动。

4. min_samples_split

1
2
3
4
5
6
7
param_grid = {'min_samples_split': np.arange(2, 22, 1)}

rtf = RandomForestClassifier(n_estimators=47, max_depth=10, min_samples_leaf=3, n_jobs=-1, random_state=90)
GS = GridSearchCV(rtf, param_grid, cv=10)
GS.fit(X, y)

print(GS.best_params_, GS.best_score_) # {'min_samples_split': 12} 0.8481439820022497

总结:准确率从 0.84252 提升到 0.84814 ,泛化误差减小,模型在图像上往左移动。

5. max_features

1
2
3
4
5
6
7
8
param_grid = {'max_features': np.arange(3, 8, 1)}

rtf = RandomForestClassifier(n_estimators=47, max_depth=10, min_samples_leaf=3,
min_samples_split=12, n_jobs=-1, random_state=90)
GS = GridSearchCV(rtf, param_grid, cv=10)
GS.fit(X, y)

print(GS.best_params_, GS.best_score_) # {'max_features': 3} 0.843644544431946

总结:调整 max_features 后,准确率下降,说明模型原先已经位于泛化误差曲线最低点,无论怎么调,都让泛化误差升高, 所以忽略此次调参。

6. criterion

1
2
3
4
5
6
7
8
param_grid = {'criterion': ['gini', 'entropy']}

rtf = RandomForestClassifier(n_estimators=47, max_depth=10, min_samples_leaf=3,
min_samples_split=12, n_jobs=-1, random_state=90)
GS = GridSearchCV(rtf, param_grid, cv=10)
GS.fit(X, y)

print(GS.best_params_, GS.best_score_) # {'criterion': 'gini'} 0.8481439820022497

总结:准确率没有提升,{‘criterion’: ‘gini’} 为默认参数,不需要调整

总结

通过调参,使得准确率从最初的 0.832 提升到 0.848

最后的调参结果是:

1
2
3
4
5
6
rtf = RandomForestClassifier(n_estimators=47, 
max_depth=10,
min_samples_leaf=3,
min_samples_split=12,
n_jobs=-1,
random_state=90)