中山大学大学生夏令营活动

1. 数据分析阶段

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
path = './input/'
# 读取数据
test_data = pd.read_csv(path + 'BigML-orginal.csv')
# 设置数据显示的宽度与长度
pd.set_option("max_columns", 1000) 
pd.set_option("max_row",300)

1.1 先观察数据

In [4]:
test_data.head()
Out[4]:
Date Temperature Humidity Operator Measure1 Measure2 Measure3 Measure4 Measure5 Measure6 Measure7 Measure8 Measure9 Measure10 Measure11 Measure12 Measure13 Measure14 Measure15 Hours Since Previous Failure Failure Date.year Date.month Date.day-of-month Date.day-of-week Date.hour Date.minute Date.second
0 2016-01-01 00:00:00 67 82 Operator1 291 1 1 1041 846 334 706 1086 256 1295 766 968 1185 1355 1842 90 No 2016 1 1 5 0 0 0
1 2016-01-01 01:00:00 68 77 Operator1 1180 1 1 1915 1194 637 1093 524 919 245 403 723 1446 719 748 91 No 2016 1 1 5 1 0 0
2 2016-01-01 02:00:00 64 76 Operator1 1406 1 1 511 1577 1121 1948 1882 1301 273 1927 1123 717 1518 1689 92 No 2016 1 1 5 2 0 0
3 2016-01-01 03:00:00 63 80 Operator1 550 1 1 1754 1834 1413 1151 945 1312 1494 1755 1434 502 1336 711 93 No 2016 1 1 5 3 0 0
4 2016-01-01 04:00:00 65 81 Operator1 1928 1 2 1326 1082 233 1441 1736 1033 1549 802 1819 1616 1507 507 94 No 2016 1 1 5 4 0 0

显示数值类型的特征的信息

In [5]:
test_data.describe()
Out[5]:
Temperature Humidity Measure1 Measure2 Measure3 Measure4 Measure5 Measure6 Measure7 Measure8 Measure9 Measure10 Measure11 Measure12 Measure13 Measure14 Measure15 Hours Since Previous Failure Date.year Date.month Date.day-of-month Date.day-of-week Date.hour Date.minute Date.second
count 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.000000 8784.0 8784.000000 8784.000000 8784.000000 8784.000000 8784.0 8784.0
mean 64.026412 83.337090 1090.900387 1.489868 0.999203 1071.629895 1075.822860 1076.023793 1086.897086 1077.277209 1082.014572 1082.403005 1088.719148 1088.329349 1076.755806 1088.307377 1082.392304 217.341872 2016.0 6.513661 15.756831 4.008197 11.500000 0.0 0.0
std 2.868833 4.836256 537.097769 1.115605 0.816473 536.518466 533.158826 534.004966 538.195156 537.187671 532.983115 537.582829 534.995992 533.299486 535.111353 537.264847 537.527604 151.751750 0.0 3.451430 8.812031 1.998047 6.922581 0.0 0.0
min 5.000000 65.000000 155.000000 0.000000 0.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 155.000000 1.000000 2016.0 1.000000 1.000000 1.000000 0.000000 0.0 0.0
25% 62.000000 80.000000 629.000000 0.000000 0.000000 608.750000 606.000000 623.000000 621.000000 612.000000 631.000000 619.000000 627.000000 627.000000 609.000000 617.000000 614.000000 90.000000 2016.0 4.000000 8.000000 2.000000 5.750000 0.0 0.0
50% 64.000000 83.000000 1096.000000 1.000000 1.000000 1058.000000 1077.000000 1072.000000 1089.000000 1074.000000 1078.000000 1080.000000 1093.000000 1082.000000 1067.000000 1088.500000 1076.000000 195.000000 2016.0 7.000000 16.000000 4.000000 11.500000 0.0 0.0
75% 66.000000 87.000000 1555.000000 2.000000 2.000000 1533.000000 1541.000000 1537.000000 1558.000000 1541.000000 1532.000000 1547.000000 1550.000000 1552.000000 1539.000000 1560.000000 1550.000000 324.000000 2016.0 10.000000 23.000000 6.000000 17.250000 0.0 0.0
max 78.000000 122.000000 2011.000000 3.000000 2.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 2011.000000 666.000000 2016.0 12.000000 31.000000 7.000000 23.000000 0.0 0.0

显示非数值类型的特征的信息

In [6]:
test_data.describe(include='object')
Out[6]:
Date Operator Failure
count 8784 8784 8784
unique 8784 8 2
top 2016-02-04 13:00:00 Operator2 No
freq 1 1952 8703

1. 这是一个二分类问题

2. 上面的描述中我们可以看到该表格中没有缺失的数值,这就省去了缺失值处理的过程

3. 同时还有表格中已经对Date,也就是日期的信息进行了处理,同时连工作日都显示出来了,因此后续应该考虑将Data变量给去除

4. 从上面的观察中我们可以发现各个不同的数据的量级存在着差异,所以在后面的线性模型以及神经网络模型中我们应该先考虑进行数据的归一化或者标准化

原因:

  • 在线性模型中多运用梯度来寻找最佳的下降方向,而树模型中则只是运用到特征的某个属性的一个特定的值来进行切分,与梯度无关。神经网络模型也是一样需要计算梯度来进行前向传递与后向反馈

  • 进行数据的标准化处理可以提高模型的收敛速度(以下图片来自于吴恩达老师的机器学习课件)

  • 避免给梯度数值的更新带来数值问题

  • 有利于学习率数值的调整(因为在每个方向上的同个步长所产生的跨度是一样的)

机器学习

1.2 显示数据的特征以及不重复元素的个数信息以便后续的数据处理

In [7]:
test_data_info = pd.DataFrame()
test_data_info['col'] = test_data.columns
unique_size = []
fea_type = []
for i in test_data.columns:
    unique_size.append(len(test_data[i].unique()))
    fea_type.append(test_data[i].dtype)
test_data_info['unique_num'] = np.array(unique_size)
test_data_info['fea_type'] = np.array(fea_type)
test_data_info
Out[7]:
col unique_num fea_type
0 Date 8784 object
1 Temperature 23 int64
2 Humidity 35 int64
3 Operator 8 object
4 Measure1 1843 int64
5 Measure2 4 int64
6 Measure3 3 int64
7 Measure4 1837 int64
8 Measure5 1839 int64
9 Measure6 1843 int64
10 Measure7 1842 int64
11 Measure8 1851 int64
12 Measure9 1839 int64
13 Measure10 1837 int64
14 Measure11 1839 int64
15 Measure12 1842 int64
16 Measure13 1841 int64
17 Measure14 1843 int64
18 Measure15 1837 int64
19 Hours Since Previous Failure 666 int64
20 Failure 2 object
21 Date.year 1 int64
22 Date.month 12 int64
23 Date.day-of-month 31 int64
24 Date.day-of-week 7 int64
25 Date.hour 24 int64
26 Date.minute 1 int64
27 Date.second 1 int64

从上面的表格中我们可以得出:

  1. Date、Operator、Failure这三个特征是字符串类型的,所以后续的处理应该考虑处理成数值类型,(因为类别比较少同时非一对多的关系)

  2. Date.year、Date.minute、Date.second中的值都是重复的,所以没有任何意义,因此应当考虑删除

1.2.1 对非数值类型的变量进行转化

In [8]:
test_data.Date.unique()
Out[8]:
array(['2016-01-01 00:00:00', '2016-01-01 01:00:00',
       '2016-01-01 02:00:00', ..., '2016-12-31 21:00:00',
       '2016-12-31 22:00:00', '2016-12-31 23:00:00'], dtype=object)
In [9]:
test_data.Failure.unique()
Out[9]:
array(['No', 'Yes'], dtype=object)
In [10]:
test_data.Operator.unique()
Out[10]:
array(['Operator1', 'Operator3', 'Operator5', 'Operator2', 'Operator4',
       'Operator6', 'Operator7', 'Operator8'], dtype=object)

对Failure与Operator进行数值的映射

In [11]:
test_data.Failure = test_data.Failure.apply(lambda x: 0  if x == 'No' else 1) 
Operator2id = dict(zip(sorted(list(set(test_data.Operator))), range(1, len(sorted(list(set(test_data.Operator))))+1)))
test_data.Operator = test_data.Operator.apply(lambda x: Operator2id[x])

丢弃不需要的特征

In [12]:
test_data.drop(columns=['Date', 'Date.year', 'Date.minute', 'Date.second'], inplace=True)

将Failure转移到最后一列

In [13]:
temp = test_data.Failure
test_data.drop(columns=['Failure'], inplace=True)
test_data['Failure'] = temp
test_data.to_csv('./input/After_preprocessing.csv', index=False)
In [14]:
test_data.head()
Out[14]:
Temperature Humidity Operator Measure1 Measure2 Measure3 Measure4 Measure5 Measure6 Measure7 Measure8 Measure9 Measure10 Measure11 Measure12 Measure13 Measure14 Measure15 Hours Since Previous Failure Date.month Date.day-of-month Date.day-of-week Date.hour Failure
0 67 82 1 291 1 1 1041 846 334 706 1086 256 1295 766 968 1185 1355 1842 90 1 1 5 0 0
1 68 77 1 1180 1 1 1915 1194 637 1093 524 919 245 403 723 1446 719 748 91 1 1 5 1 0
2 64 76 1 1406 1 1 511 1577 1121 1948 1882 1301 273 1927 1123 717 1518 1689 92 1 1 5 2 0
3 63 80 1 550 1 1 1754 1834 1413 1151 945 1312 1494 1755 1434 502 1336 711 93 1 1 5 3 0
4 65 81 1 1928 1 2 1326 1082 233 1441 1736 1033 1549 802 1819 1616 1507 507 94 1 1 5 4 0
In [15]:
test_data_info = pd.DataFrame()
test_data_info['col'] = test_data.columns
unique_size = []
fea_type = []
for i in test_data.columns:
    unique_size.append(len(test_data[i].unique()))
    fea_type.append(test_data[i].dtype)
test_data_info['unique_num'] = np.array(unique_size)
test_data_info['fea_type'] = np.array(fea_type)
test_data_info
Out[15]:
col unique_num fea_type
0 Temperature 23 int64
1 Humidity 35 int64
2 Operator 8 int64
3 Measure1 1843 int64
4 Measure2 4 int64
5 Measure3 3 int64
6 Measure4 1837 int64
7 Measure5 1839 int64
8 Measure6 1843 int64
9 Measure7 1842 int64
10 Measure8 1851 int64
11 Measure9 1839 int64
12 Measure10 1837 int64
13 Measure11 1839 int64
14 Measure12 1842 int64
15 Measure13 1841 int64
16 Measure14 1843 int64
17 Measure15 1837 int64
18 Hours Since Previous Failure 666 int64
19 Date.month 12 int64
20 Date.day-of-month 31 int64
21 Date.day-of-week 7 int64
22 Date.hour 24 int64
23 Failure 2 int64

2. 数据特征显示阶段

In [16]:
import seaborn as sns

2.1 查看failure占比,也就是正负样本的占比

In [17]:
plt.figure(figsize=(8, 6))
sns.countplot(test_data.Failure)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d5a4156a58>
In [18]:
test_data.Failure.value_counts()
Out[18]:
0    8703
1      81
Name: Failure, dtype: int64

从上面的正负样本的比例中我们可以观察看到正负样本的比例严重失调,达到了100:1的数量级,这在后面是一定要进行处理的

  • 如果不进行处理,我们可以知道在99%的情况下机器都是运行良好的,那么如果全部都猜运行良好,那么准确率的值也有99%以上,所以后面要对数据的样本比例进行调整,同时也需要对评估函数进行修正。

2.2 绘制在Failure为0与1情况下每个特征的直方图

In [19]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(4,6,figsize=(24,14))

    for feature in features:
        i += 1
        plt.subplot(4,6,i)
        sns.distplot(df1[feature], hist=False, label=label1)
        sns.distplot(df2[feature], hist=False, label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.show();
In [20]:
t0 = test_data.loc[test_data['Failure'] == 0]
t1 = test_data.loc[test_data['Failure'] == 1]
features = test_data.columns.values[:-1]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>

1. 由于正样本的比例比较少,也就是为1的时候比较少,所以会造成曲线的变化比较多,所以我们要进行观察的只需要看曲线的峰值以及大概的变化趋势是否一致

2. 可以观察得到Temperature、Humidity、Hours Since Previous Failure对于Failure的取值有着很大的关系

3. 模型构建

In [21]:
from sklearn.feature_selection import SelectFromModel, VarianceThreshold, SelectKBest, chi2, mutual_info_classif, f_classif
from sklearn.preprocessing import Imputer
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
Using TensorFlow backend.

3.1 划分数据集与对数据集的预处理

训练集+验证集:测试集 = 7:3

训练集:验证集 = 2:1

3折交叉验证

In [22]:
estimate = pd.DataFrame(index=['f1_score', 'accuracy'])
In [23]:
X, X_test, y, y_test = train_test_split(test_data[test_data.columns[:-1]],test_data.Failure,test_size=0.3)

样本分布不均匀,使用SMOTE进行过采样

SMOTE的原理是:

  1. 随机选择少类中的一个样本a

**SMOTE的原理

  1. 然后从距离该样本较近的n的该类的样本中随机选择一个样本b

  2. 将a与b连接起来,中间的连线上随机选取一点即可作为新的该类的样本

**SMOTE的原理2

In [24]:
# 对训练集进行SMOTE变换
sm = SMOTE(random_state = 42, n_jobs = -1)
X, y = sm.fit_sample(X,y)
In [25]:
# 划分训练集与验证集
skf = StratifiedKFold(n_splits=3, shuffle=True)
In [26]:
# 自定义一个metrics函数用与后面的LGBM的使用
def f1_eval(preds, train_data):
    temp = preds
    for i in range(len(temp)):
        if temp[i] >= 0.5:
            temp[i] = 1
        else:
            temp[i] = 0
    return 'f1_score', f1_score(train_data.get_label(), temp), True
In [27]:
# 数据的标准化处理
from sklearn.preprocessing import MinMaxScaler
standard = MinMaxScaler()
standard.fit(X)
X_standard = standard.transform(X)
X_test_standard = standard.transform(X_test)

下面都采用F1_score以及准确率评估函数进行模型的评估

  • F1_score

TP(True Positive):预测答案正确

FP(False Positive):错将其他类预测为本类

FN(False Negative):本类标签预测为其他类标

其中精准度的概念如下

召回率

3.2 使用逻辑回归

In [28]:
from sklearn.linear_model import LogisticRegression
In [29]:
# 使用lbfgs的模型,也就是二阶梯度以及使用预处理之后的数据
lr = LogisticRegression(solver='lbfgs')
In [30]:
lr.fit(X_standard, y)
lr.score(X_test_standard, y_test)
Out[30]:
0.9389226100151745
In [31]:
lr_result = lr.predict(X_test_standard)
lr_result_data = pd.DataFrame(data=lr_result, columns=['lr_result'])
lr_result_data['true'] = y_test.values
In [32]:
estimate['logistic regression'] = np.array([f1_score(lr.predict(X_test_standard), y_test), accuracy_score(lr.predict(X_test_standard), y_test)])
print('f1_score:', f1_score(lr.predict(X_test_standard), y_test))
print('accuracy:', accuracy_score(lr.predict(X_test_standard), y_test))
f1_score: 0.19095477386934673
accuracy: 0.9389226100151745
In [33]:
lr_result_data.lr_result.value_counts()
Out[33]:
0    2461
1     175
Name: lr_result, dtype: int64
In [34]:
lr_result_data.true.value_counts()
Out[34]:
0    2612
1      24
Name: true, dtype: int64
In [35]:
lr_result_data[lr_result_data['true'] == 1]
Out[35]:
lr_result true
13 1 1
188 1 1
228 1 1
286 1 1
381 0 1
426 1 1
585 0 1
722 1 1
811 1 1
814 1 1
853 0 1
1017 1 1
1053 0 1
1143 1 1
1223 1 1
1263 1 1
1339 1 1
1349 0 1
1392 1 1
1399 1 1
1436 1 1
1794 1 1
1827 1 1
2085 1 1

3.3 使用随机森林(Bagging模型)

Bagging:一个有放回采样的方式

1.对于训练数据:先进行放回采样(有些样本可能多次抽取,有些样本可能一次都没有抽取到),这个过程进行K次抽取,就得到了k个训练集。(很多机器学习模型最后的效果是因为噪声影响,在大量重复采样的过程中,我们每次采到噪声的可能性很小,我们就能更真实的看到原始数据的分布,而忽略了噪声的影响)

2.把k个训练数据进行训练,得到k个模型,最后由这k个模型进行投票得到结果。(每个模型的权重都是一样的)

In [36]:
from sklearn.ensemble import RandomForestClassifier

模型使用的是50个决策树以及特征抽取的方式使用sqrt的方法,树的深度最大为10

In [37]:
rf_valid_result = np.zeros((X.shape[0], skf.n_splits))
sub_preds = np.zeros((X_test.shape[0], skf.n_splits))
for n_fold, (train_index, vali_index) in enumerate(skf.split(X, y), start=1):
    print('fold:', n_fold)
    x_train, y_train, x_vali, y_vali = np.array(X)[train_index], np.array(y)[train_index], np.array(X)[vali_index], np.array(y)[vali_index]
    rf = RandomForestClassifier(n_estimators=50, max_features='sqrt', max_depth=10, n_jobs=2)
    rf.fit(x_train, y_train)
    print('f1_score of valid dataset :', rf.score(x_vali, y_vali))
    rf_valid_result[:, n_fold - 1] = rf.predict(X)
    sub_preds[:, n_fold - 1] = rf.predict(X_test)
fold: 1
f1_score of valid dataset : 0.9940915805022157
fold: 2
f1_score of valid dataset : 0.9953201970443349
fold: 3
f1_score of valid dataset : 0.9985221674876847

显示每个特征的重要程度

In [38]:
plt.figure(figsize=[30, 10])
plt.bar(features, rf.feature_importances_)
Out[38]:
<BarContainer object of 23 artists>
In [39]:
rf_result = np.mean(sub_preds, axis=1)
estimate['random forest'] = np.array([f1_score(y_test, np.round(rf_result)), accuracy_score(y_test, np.round(rf_result))])
print('f1_score of test dataset:', f1_score(y_test, np.round(rf_result)))
print('accuracy_score of test dataset:', accuracy_score(y_test, np.round(rf_result)))
f1_score of test dataset: 0.76
accuracy_score of test dataset: 0.9954476479514416
In [40]:
rf_result_data = pd.DataFrame(data=np.round(rf_result), columns=['rf_result'])
rf_result_data['true'] = y_test.values
rf_result_data.rf_result = rf_result_data.rf_result.astype(int)
rf_result_data[rf_result_data['true'] == 1]
Out[40]:
rf_result true
13 1 1
188 1 1
228 1 1
286 1 1
381 0 1
426 1 1
585 1 1
722 1 1
811 1 1
814 1 1
853 0 1
1017 1 1
1053 0 1
1143 1 1
1223 1 1
1263 1 1
1339 1 1
1349 0 1
1392 1 1
1399 1 1
1436 0 1
1794 1 1
1827 1 1
2085 1 1
In [41]:
print(rf_result_data.rf_result.value_counts())
print(rf_result_data.true.value_counts())
0    2610
1      26
Name: rf_result, dtype: int64
0    2612
1      24
Name: true, dtype: int64

3.4使用lightgbm(Boosting模型)

Boosting 模型

简单的讲就是每个数据以及分类器都具有一定的权重:

  1. 初始的分布应为等概分布,也就是训练集如果有 n个样本,每个样本的分布概率为1/ n。

  2. 每次循环后提高错误样本的分布概率,分错的样本在训练集中所占权重增大,使得下一次循环的基分类器能够集中力量对这些错误样本进行判断。

  3. 最后的强分类器是通过多个基分类器联合得到的,因此在最后联合时各个基分类器所起的作用对联合结果有很大的影响,因为不同基分类器的识别率不同,他的作用就应该不同,这里通过权值体现他的作用,因此识别率越高的基分类器权重越高,识别率越低的基分类器权重越低。

In [42]:
import lightgbm as lgb

以下是模型的参数

In [43]:
param = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 15,
    'nthread': 2,
}
In [44]:
sub_preds = np.zeros((X_test.shape[0], skf.n_splits))
lgb_valid_result = np.zeros((X.shape[0], skf.n_splits))
for n_fold, (train_index, vali_index) in enumerate(skf.split(X, y), start=1):
    print( n_fold)
    x_train, y_train, x_vali, y_vali = np.array(X)[train_index], np.array(y)[train_index], np.array(X)[vali_index], np.array(y)[vali_index]
    train = lgb.Dataset(x_train, label=y_train)
    vali =lgb.Dataset(x_vali, label=y_vali)
    print("training start...")
    model = lgb.train(param, train, num_boost_round=1000, valid_sets=[train, vali], early_stopping_rounds=20, verbose_eval=10, feval=f1_eval)
    lgb_valid_result[:, n_fold - 1] = model.predict(X, num_iteration=model.best_iteration)
    sub_preds[:, n_fold - 1] = model.predict(X_test,num_iteration=model.best_iteration)
1
training start...
Training until validation scores don't improve for 20 rounds.
[10]	training's auc: 0.99662	training's f1_score: 0.973941	valid_1's auc: 0.995966	valid_1's f1_score: 0.970447
[20]	training's auc: 0.998351	training's f1_score: 0.977278	valid_1's auc: 0.998317	valid_1's f1_score: 0.974088
[30]	training's auc: 0.999228	training's f1_score: 0.987605	valid_1's auc: 0.999126	valid_1's f1_score: 0.981038
[40]	training's auc: 0.999394	training's f1_score: 0.992463	valid_1's auc: 0.999295	valid_1's f1_score: 0.986363
[50]	training's auc: 0.999637	training's f1_score: 0.995069	valid_1's auc: 0.999496	valid_1's f1_score: 0.990622
[60]	training's auc: 0.999705	training's f1_score: 0.996553	valid_1's auc: 0.999573	valid_1's f1_score: 0.992852
[70]	training's auc: 0.999789	training's f1_score: 0.997415	valid_1's auc: 0.999732	valid_1's f1_score: 0.994331
[80]	training's auc: 0.999892	training's f1_score: 0.998031	valid_1's auc: 0.999796	valid_1's f1_score: 0.995814
[90]	training's auc: 0.999945	training's f1_score: 0.998523	valid_1's auc: 0.999843	valid_1's f1_score: 0.995818
[100]	training's auc: 0.999979	training's f1_score: 0.998769	valid_1's auc: 0.999889	valid_1's f1_score: 0.996063
[110]	training's auc: 0.999994	training's f1_score: 0.999138	valid_1's auc: 0.999902	valid_1's f1_score: 0.996559
[120]	training's auc: 0.999999	training's f1_score: 0.999508	valid_1's auc: 0.999925	valid_1's f1_score: 0.996805
[130]	training's auc: 1	training's f1_score: 0.999877	valid_1's auc: 0.999943	valid_1's f1_score: 0.997052
[140]	training's auc: 1	training's f1_score: 0.999877	valid_1's auc: 0.99996	valid_1's f1_score: 0.997544
Early stopping, best iteration is:
[122]	training's auc: 1	training's f1_score: 0.999877	valid_1's auc: 0.999926	valid_1's f1_score: 0.99656
2
training start...
Training until validation scores don't improve for 20 rounds.
[10]	training's auc: 0.996542	training's f1_score: 0.974956	valid_1's auc: 0.995626	valid_1's f1_score: 0.972514
[20]	training's auc: 0.998214	training's f1_score: 0.979526	valid_1's auc: 0.997668	valid_1's f1_score: 0.976814
[30]	training's auc: 0.999261	training's f1_score: 0.987119	valid_1's auc: 0.998915	valid_1's f1_score: 0.982107
[40]	training's auc: 0.999429	training's f1_score: 0.993086	valid_1's auc: 0.999279	valid_1's f1_score: 0.988881
[50]	training's auc: 0.999632	training's f1_score: 0.996185	valid_1's auc: 0.999545	valid_1's f1_score: 0.993355
[60]	training's auc: 0.999714	training's f1_score: 0.996555	valid_1's auc: 0.999648	valid_1's f1_score: 0.993107
[70]	training's auc: 0.999825	training's f1_score: 0.997662	valid_1's auc: 0.999719	valid_1's f1_score: 0.99385
[80]	training's auc: 0.999891	training's f1_score: 0.998524	valid_1's auc: 0.999806	valid_1's f1_score: 0.995333
[90]	training's auc: 0.999931	training's f1_score: 0.999016	valid_1's auc: 0.999827	valid_1's f1_score: 0.995575
[100]	training's auc: 0.999959	training's f1_score: 0.999262	valid_1's auc: 0.999838	valid_1's f1_score: 0.997052
[110]	training's auc: 0.999992	training's f1_score: 0.999262	valid_1's auc: 0.999853	valid_1's f1_score: 0.997298
[120]	training's auc: 1	training's f1_score: 0.999631	valid_1's auc: 0.999872	valid_1's f1_score: 0.997543
[130]	training's auc: 1	training's f1_score: 0.999754	valid_1's auc: 0.999898	valid_1's f1_score: 0.997788
[140]	training's auc: 1	training's f1_score: 0.999877	valid_1's auc: 0.999928	valid_1's f1_score: 0.998033
Early stopping, best iteration is:
[128]	training's auc: 1	training's f1_score: 0.999754	valid_1's auc: 0.99989	valid_1's f1_score: 0.997298
3
training start...
Training until validation scores don't improve for 20 rounds.
[10]	training's auc: 0.996393	training's f1_score: 0.968288	valid_1's auc: 0.994391	valid_1's f1_score: 0.955815
[20]	training's auc: 0.997642	training's f1_score: 0.980397	valid_1's auc: 0.99634	valid_1's f1_score: 0.975025
[30]	training's auc: 0.999048	training's f1_score: 0.987608	valid_1's auc: 0.998155	valid_1's f1_score: 0.982891
[40]	training's auc: 0.999359	training's f1_score: 0.993224	valid_1's auc: 0.998853	valid_1's f1_score: 0.98816
[50]	training's auc: 0.999644	training's f1_score: 0.995691	valid_1's auc: 0.999332	valid_1's f1_score: 0.991626
[60]	training's auc: 0.999739	training's f1_score: 0.996678	valid_1's auc: 0.999446	valid_1's f1_score: 0.994328
[70]	training's auc: 0.999825	training's f1_score: 0.997663	valid_1's auc: 0.99955	valid_1's f1_score: 0.994576
[80]	training's auc: 0.999883	training's f1_score: 0.998524	valid_1's auc: 0.999613	valid_1's f1_score: 0.99655
[90]	training's auc: 0.999951	training's f1_score: 0.998893	valid_1's auc: 0.999633	valid_1's f1_score: 0.996305
[100]	training's auc: 0.999992	training's f1_score: 0.999262	valid_1's auc: 0.999673	valid_1's f1_score: 0.996552
[110]	training's auc: 0.999999	training's f1_score: 0.999385	valid_1's auc: 0.999695	valid_1's f1_score: 0.997044
[120]	training's auc: 1	training's f1_score: 0.999508	valid_1's auc: 0.999731	valid_1's f1_score: 0.996797
Early stopping, best iteration is:
[109]	training's auc: 0.999999	training's f1_score: 0.999262	valid_1's auc: 0.999697	valid_1's f1_score: 0.997044
In [45]:
plt.figure(figsize=[30, 10])
plt.bar(features, model.feature_importance(importance_type='split'))
Out[45]:
<BarContainer object of 23 artists>
In [46]:
lgb_result = np.mean(sub_preds, axis=1)
estimate['lightgbm'] = np.array([f1_score(y_test, np.round(lgb_result)), accuracy_score(y_test, np.round(lgb_result))])
print('f1 score:', f1_score(y_test, np.round(lgb_result)))
print('accuracy:', accuracy_score(y_test, np.round(lgb_result)))
f1 score: 0.6440677966101694
accuracy: 0.9920333839150227
In [47]:
lgb_result_data = pd.DataFrame(data=np.round(lgb_result), columns=['lgb_result'])
lgb_result_data.lgb_result = lgb_result_data.lgb_result.astype(int)
lgb_result_data['true'] = y_test.values
lgb_result_data[lgb_result_data['true'] == 1]
Out[47]:
lgb_result true
13 1 1
188 1 1
228 1 1
286 1 1
381 0 1
426 1 1
585 1 1
722 1 1
811 1 1
814 1 1
853 0 1
1017 1 1
1053 0 1
1143 1 1
1223 1 1
1263 1 1
1339 1 1
1349 0 1
1392 1 1
1399 1 1
1436 0 1
1794 1 1
1827 1 1
2085 1 1
In [48]:
print(lgb_result_data.lgb_result.value_counts())
print(lgb_result_data.true.value_counts())
0    2601
1      35
Name: lgb_result, dtype: int64
0    2612
1      24
Name: true, dtype: int64

3.5 构建全连接神经网络模型

使用标准化之后的数据集

In [49]:
import torch
import torch.nn.functional as f
from torch.autograd import Variable
import matplotlib.pyplot as plt
# 训练集
x0 = torch.FloatTensor(X_standard)
y0 = torch.LongTensor(y)
# 测试集
x1 = torch.FloatTensor(X_test_standard)
y1 = torch.LongTensor(y_test.values)
x_, y_ = Variable(x0), Variable(y0)  

网络结构如下:全连接神经网络

In [50]:
class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)
        self.out = torch.nn.Linear(n_hidden, n_output)
        
    def forward(self, x):
        x = f.relu(self.hidden(x))
        y = self.out(x)
        return y

net = Net(n_feature=23, n_hidden=100, n_output=2) 
net.eval()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02)
loss_func = torch.nn.CrossEntropyLoss()
for i in range(4000):
    out = net(x_)
    loss = loss_func(out, y_)
    if i % 200 == 0:
        print(loss)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
tensor(0.7033, grad_fn=<NllLossBackward>)
tensor(0.6324, grad_fn=<NllLossBackward>)
tensor(0.5529, grad_fn=<NllLossBackward>)
tensor(0.4753, grad_fn=<NllLossBackward>)
tensor(0.4174, grad_fn=<NllLossBackward>)
tensor(0.3747, grad_fn=<NllLossBackward>)
tensor(0.3408, grad_fn=<NllLossBackward>)
tensor(0.3128, grad_fn=<NllLossBackward>)
tensor(0.2894, grad_fn=<NllLossBackward>)
tensor(0.2696, grad_fn=<NllLossBackward>)
tensor(0.2528, grad_fn=<NllLossBackward>)
tensor(0.2387, grad_fn=<NllLossBackward>)
tensor(0.2266, grad_fn=<NllLossBackward>)
tensor(0.2163, grad_fn=<NllLossBackward>)
tensor(0.2074, grad_fn=<NllLossBackward>)
tensor(0.1996, grad_fn=<NllLossBackward>)
tensor(0.1927, grad_fn=<NllLossBackward>)
tensor(0.1866, grad_fn=<NllLossBackward>)
tensor(0.1811, grad_fn=<NllLossBackward>)
tensor(0.1762, grad_fn=<NllLossBackward>)
In [51]:
X_predicteds = []
_, predicted = torch.max(net(x_), 1)
X_predicteds.extend(predicted.int().tolist())
In [52]:
predicteds = []
_, predicted = torch.max(net(x1), 1)
predicteds.extend(predicted.int().tolist())
In [53]:
estimate['dnn'] = np.array([f1_score(np.array(predicteds) ,y_test.values), accuracy_score(np.array(predicteds) ,y_test.values)])
print('f1_score', f1_score(np.array(predicteds) ,y_test.values))
print('accuracy_score', accuracy_score(np.array(predicteds) ,y_test.values))
f1_score 0.21276595744680848
accuracy_score 0.9438543247344461
In [54]:
dnn_result_data = pd.DataFrame(data=np.array(predicteds), columns=['dnn_result'])
dnn_result_data['true'] = y_test.values
dnn_result_data[dnn_result_data['true'] == 1]
Out[54]:
dnn_result true
13 1 1
188 1 1
228 1 1
286 1 1
381 0 1
426 1 1
585 0 1
722 1 1
811 1 1
814 1 1
853 0 1
1017 1 1
1053 1 1
1143 1 1
1223 1 1
1263 1 1
1339 1 1
1349 0 1
1392 1 1
1399 1 1
1436 1 1
1794 1 1
1827 1 1
2085 1 1
In [55]:
print(dnn_result_data.dnn_result.value_counts())
print(dnn_result_data.true.value_counts())
0    2472
1     164
Name: dnn_result, dtype: int64
0    2612
1      24
Name: true, dtype: int64

3.6 使用stacking进行模型融合

使用的方法是将上面四种模型的结果进行结合,然后使用随机森林再进行第二层分类

In [56]:
rf_valid = np.mean(rf_valid_result, axis=1)
lgb_valid = np.mean(lgb_valid_result, axis=1)
new_X = pd.DataFrame()
new_X['lr'] = lr.predict(X_standard)
new_X['rf'] = rf_valid
new_X['lgb'] = lgb_valid
new_X['dnn'] = np.array(X_predicteds)
new_X_test = pd.DataFrame()
new_X_test['lr'] = lr.predict(X_test_standard)
new_X_test['rf'] = rf_result
new_X_test['lgb'] = lgb_result
new_X_test['dnn'] = np.array(predicteds)
In [57]:
sub_preds = np.zeros((new_X_test.shape[0], skf.n_splits))
for n_fold, (train_index, vali_index) in enumerate(skf.split(new_X, y), start=1):
    print( n_fold)
    x_train, y_train, x_vali, y_vali = np.array(new_X)[train_index], np.array(y)[train_index], np.array(new_X)[vali_index], np.array(y)[vali_index]
    rf = RandomForestClassifier(n_estimators=10, max_features=2, max_depth=2, n_jobs=2)
    rf.fit(x_train, y_train)
    print(rf.score(x_vali, y_vali))
    sub_preds[:, n_fold - 1] = rf.predict(new_X_test)
1
0.999015263417036
2
0.9992610837438424
3
0.9997536945812808
In [58]:
stacking_result = np.mean(sub_preds, axis=1)
estimate['stacking'] = np.array([f1_score(y_test, np.round(stacking_result)), accuracy_score(y_test, np.round(stacking_result))])
print('f1_score:', f1_score(y_test, np.round(stacking_result)))
print('accuracy:', accuracy_score(y_test, np.round(stacking_result)))
f1_score: 0.7037037037037038
accuracy: 0.9939301972685888
In [59]:
stacking_result_data = pd.DataFrame(data=np.round(stacking_result), columns=['stacking_result'])
stacking_result_data['true'] = y_test.values
stacking_result_data.stacking_result = stacking_result_data.stacking_result.astype(int)
stacking_result_data[stacking_result_data['true'] == 1]
Out[59]:
stacking_result true
13 1 1
188 1 1
228 1 1
286 1 1
381 0 1
426 1 1
585 1 1
722 1 1
811 1 1
814 1 1
853 0 1
1017 1 1
1053 0 1
1143 1 1
1223 1 1
1263 1 1
1339 1 1
1349 0 1
1392 1 1
1399 1 1
1436 0 1
1794 1 1
1827 1 1
2085 1 1
In [60]:
print(stacking_result_data.stacking_result.value_counts())
print(stacking_result_data.true.value_counts())
0    2606
1      30
Name: stacking_result, dtype: int64
0    2612
1      24
Name: true, dtype: int64
In [61]:
estimate
Out[61]:
logistic regression random forest lightgbm dnn stacking
f1_score 0.190955 0.760000 0.644068 0.212766 0.703704
accuracy 0.938923 0.995448 0.992033 0.943854 0.993930
In [62]:
result_data = pd.DataFrame(data=np.round(lr_result), columns=['lr_result'])
result_data['rf_result'] = np.round(rf_result)
result_data['lgb_result'] = np.round(lgb_result)
result_data['dnn_result'] = np.array(predicteds)
result_data['stacking_result'] = np.round(stacking_result)
result_data['true'] = y_test.values
result_data.lr_result = result_data.lr_result.astype(int)
result_data.rf_result = result_data.rf_result.astype(int)
result_data.stacking_result = result_data.stacking_result.astype(int)
result_data.lgb_result = result_data.lgb_result.astype(int)
result_data[dnn_result_data['true'] == 1]
Out[62]:
lr_result rf_result lgb_result dnn_result stacking_result true
13 1 1 1 1 1 1
188 1 1 1 1 1 1
228 1 1 1 1 1 1
286 1 1 1 1 1 1
381 0 0 0 0 0 1
426 1 1 1 1 1 1
585 0 1 1 0 1 1
722 1 1 1 1 1 1
811 1 1 1 1 1 1
814 1 1 1 1 1 1
853 0 0 0 0 0 1
1017 1 1 1 1 1 1
1053 0 0 0 1 0 1
1143 1 1 1 1 1 1
1223 1 1 1 1 1 1
1263 1 1 1 1 1 1
1339 1 1 1 1 1 1
1349 0 0 0 0 0 1
1392 1 1 1 1 1 1
1399 1 1 1 1 1 1
1436 1 0 0 1 0 1
1794 1 1 1 1 1 1
1827 1 1 1 1 1 1
2085 1 1 1 1 1 1

4. 总结

  1. 从这一次项目中自己最有体会以及学习最多的就是类别分布不均匀的状况,一开始没有注意到这个问题,然后用随机森林跑出来的结果用准确率评估之后达到了99%,不过自己刚好想看看什么时候出现了故障的情况,结果发现了没有存在预测值为1的情况,于是寻找问题,最终发现了正负比例高达1:100,于是得思考对于非平衡的数据集该如何进行处理,经过查阅网上的资料,知道了过采样与欠采样的方法,于是使用了过采样的方法之后虽然模型的准确度下降,但是预测为正样本正确的情况达到了70%左右

  2. 之前自己比较擅长的是线性模型以及树模型,而在这一次的项目中自己第一次使用了神经网络,虽然很简陋,但是使用非常方便和快捷。不过可能是因为模型或者参数的问题,神经网络的结果没有十分得乐观。

  3. 第一次尝试了模型融合,使用stacking的方法将前面4中模型进行融合,并且得到了更好的结果,这让我感到很兴奋,不过因为自己写得模型并没有很多,所以后面应该尝试更多的模型的融合。

  4. 感谢老师与师姐的教导,这一次项目使自己学习颇丰,谢谢!