一个实例 with rf

rf原理很简单，不多赘述，懒得实现直接调包吧

1.读入数据

数据来自https://www.kaggle.com/ludobenistant/hr-analytics

satisfaction_level : 满意度
last_evaluation : 绩效评估
number_project : 完成项目数
average_montly_hours : 平均月度工作时间
time_spend_company : 服务年限
Work_accident : 是否有工伤
left : 是否离职（选这个作为因变量，简陋的员工流失分析哈哈）
promotion_last_5years: 过去5年是否有升职
sales : 工作部门
salary：薪资水平

import pandas as pd
import numpy as np
import os
%matplotlib inline

## 将代码块运行结果全部输出，而不是只输出最后的，适用于全文
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"    

## 读入
os.chdir("/Users/mac/Desktop/快乐研一/python/机器学习/集成学习-例子")
df0 = pd.read_csv("HR_comma_sep.csv")
df = df0.rename(columns={'sales' : 'department'}) # 部门


## 检查缺失
df.isnull().any()
df.head()
df.shape

(14999, 10)

2.描述统计

2.1连续变量

df.describe()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

2.2 相关矩阵

corr = df.corr()
import seaborn as sns

sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

社畜真实，绩效和花费时间、项目数成正相关

2.3 分类变量

print(df.department.value_counts() / len(df), "\n")   # 哪些部门
print(df.salary.value_counts() / len(df))   # 工资水平分布

sales          0.276018
technical      0.181345
support        0.148610
IT             0.081805
product_mng    0.060137
marketing      0.057204
RandD          0.052470
accounting     0.051137
hr             0.049270
management     0.042003
Name: department, dtype: float64 

low       0.487766
medium    0.429762
high      0.082472
Name: salary, dtype: float64

2.4 是否离职

离职率

# 离职率
print("离职率: {}".format(df.left.mean()))

# 按是否left group by自变量
df.groupby('left').mean()

离职率: 0.2380825388359224

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	promotion_last_5years
left
0	0.666810	0.715473	3.786664	199.060203	3.380032	0.175009	0.026251
1	0.440098	0.718113	3.855503	207.419210	3.876505	0.047326	0.005321

大致看出，离职的人不咋满意、工作更长、没得到晋升

下面看看是否left（两个总体）在一些自变量上的分布是否存在差异

满意度 - 离职 V.S. 未离职

import matplotlib.pyplot as plt
import matplotlib.font_manager as mfm
font_path = r"/Users/mac/Library/Fonts/字体管家方萌简（非商业使用）v1.1.ttf"
prop = mfm.FontProperties(fname = font_path)

fig = plt.figure(figsize=(15,4),)
ax1 = sns.kdeplot(df.loc[(df['left'] == 0), 'satisfaction_level'] , color='orange', shade=True, label='not left')
ax1 = sns.kdeplot(df.loc[(df['left'] == 1), 'satisfaction_level'] , color='dodgerblue', shade=True, label='left')
ax1.set(xlabel = 'satisfaction', ylabel = 'freq')
plt.title('满意度 - 离职 V.S. 未离职', fontproperties = prop, fontsize = 30)

绩效评估 - 离职 V.S. 未离职

fig = plt.figure(figsize=(15,4),)
ax1 = sns.kdeplot(df.loc[(df['left'] == 0), 'last_evaluation'] , color = 'orange', shade = True, label = 'not left')
ax1 = sns.kdeplot(df.loc[(df['left'] == 1), 'last_evaluation'] , color = 'dodgerblue', shade = True, label = 'left')
ax1.set(xlabel = 'satisfaction', ylabel = 'freq')
plt.title('绩效评估 - 离职 V.S. 未离职', fontproperties = prop, fontsize = 30)

懂了，废物和大佬跑路之后，留在公司的都是混子

3. 建模

# 1.预处理的：
from sklearn.model_selection import train_test_split

# 2.建立决策树和rf
from sklearn import tree  # decision tree
from sklearn.ensemble import RandomForestClassifier  # rf
from sklearn.metrics import accuracy_score # 计算预测准确率

3.1 预处理

提取xy
把分类变量 string 调整为 category，再编码
切分训练和测试

# 1.提取特征
feature_names = [col for col in df.columns if col != 'left']   
X = df.drop('left', axis = 1)
y = df['left']


# 2.处理字符串变量
print(dict(enumerate(df["department"].astype('category').cat.categories)))  
X["department"] = X["department"].astype('category').cat.codes
X["salary"] = X["salary"].astype('category').cat.codes


# 3.分为训练和测试数据集
# 注意参数 stratify = y 意味着在产生训练和测试数据中, 离职的员工的百分比等于原来总的数据中的离职的员工的百分比
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, stratify = y)
X_train.shape, X_test.shape
X_train.head()

{0: 'IT', 1: 'RandD', 2: 'accounting', 3: 'hr', 4: 'management', 5: 'marketing', 6: 'product_mng', 7: 'sales', 8: 'support', 9: 'technical'}


((11999, 9), (3000, 9))

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	department	salary
13660	0.50	0.77	3	267	2	4	0
5606	0.66	0.92	4	239	3	3	2
12322	0.61	0.47	2	253	3	7	1
2	0.11	0.88	7	272	4	7	2
587	0.80	0.83	2	211	3	8	1

3.2 决策树

# decision tree
clf = tree.DecisionTreeClassifier(
    criterion = 'gini',   # entropy_熵 也就是id3
    max_depth = 5,   # 树的深度限制
    min_weight_fraction_leaf = 0.005  # 叶子节点最少包含样本量
)
clf = clf.fit(X_train, y_train)     # 训练
clf.score(X_train, y_train)       # 拟合
y_pred_clf = clf.predict(X_test)    # 预测
accuracy_score(y_pred_clf, y_test)

0.9674139511625969

0.9646666666666667

## 可视化：
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydotplus 
from IPython.display import Image

dot_data = StringIO()
export_graphviz(clf, out_file = dot_data,  
                filled = True, rounded = True,
                special_characters = True,
                feature_names = feature_names,
                class_names = ['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())    # 生成graph
Image(graph.create_png())     # 显示图片

3.3 rf

# decision tree
rf = RandomForestClassifier(
    criterion='gini',
    n_estimators = 10,     # 一共建立10颗树
    max_depth = None,      
    min_samples_split = 5, 
)

rf = rf.fit(X_train, y_train)     # 训练
rf.score(X_train, y_train)       # 拟合
y_pred_rf = rf.predict(X_test)    # 预测
accuracy_score(y_pred_rf, y_test)

0.9947495624635386

0.9866666666666667

3.4 对比决策树和随机森林

变量重要性方面

# importances 按照分割前后熵/gini_index的减少量算的
importances = pd.DataFrame({"feature": feature_names,  "rf_importances": rf.feature_importances_, "tree_importances": clf.feature_importances_})
importances

plt.figure(figsize = (12, 6))
plt.title("Feature importances by RandomForest")
plt.bar(range(len(importances)), importances['rf_importances'], color='lightblue',  align="center")
plt.step(range(len(importances)), np.cumsum(importances['rf_importances']), where='mid', label='Cumulative')
plt.xticks(range(len(importances)), feature_names, rotation='vertical', fontsize=14)
plt.xlim([-1, len(importances)])
plt.show()

plt.figure(figsize = (12, 6))
plt.title("Feature importances by RandomForest")
plt.bar(range(len(importances)), importances['tree_importances'], color='lightblue',  align="center")
plt.step(range(len(importances)), np.cumsum(importances['tree_importances']), where='mid', label='Cumulative')
plt.xticks(range(len(importances)), feature_names, rotation='vertical', fontsize=14)
plt.xlim([-1, len(importances)])
plt.show()

	feature	rf_importances	tree_importances
0	satisfaction_level	0.343161	0.563355
1	last_evaluation	0.145291	0.141229
2	number_project	0.173969	0.101539
3	average_montly_hours	0.136404	0.054792
4	time_spend_company	0.176457	0.138497
5	Work_accident	0.008887	0.000000
6	promotion_last_5years	0.000986	0.000000
7	department	0.009074	0.000589
8	salary	0.005771	0.000000

可以看出，随机森林特征重要性相对比较均匀

是由于每次分割遍历的特征集合是个 random choose 的特征子集，给了一些原本不太重要的属性more chance作为分割属性。

roc

from sklearn.metrics import roc_curve
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:,1])

plt.figure()
plt.plot(rf_fpr, rf_tpr, label='Random Forest')
plt.plot(dt_fpr, dt_tpr, label='Decision Tree')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

sklearn | 决策树 & 随机森林