分类模型——LDA&SVM
- 读入和处理数据
Sklearn中SVM有LinearSVC、NuSVC和SVC三种方法
- LinearSVC
- NuSVC
- SVC
- 统一方法：
实例——高斯核SVM
实例——LDA

分类模型——LDA&SVM

结合一个实例分析，约会数据分析，暂时取二变量方便理解

读入和处理数据

#j加载
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## 将代码块运行结果全部输出，而不是只输出最后的，适用于全文
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"    


## 不显示warnings
import warnings
warnings.filterwarnings("ignore")

## 读入数据
os.chdir("/Users/mac/Desktop/狗熊会/分类model/约会数据集")
yuehui0 = pd.read_csv("Speed Dating Data.csv",encoding="gbk")
yuehui0.head()

## 提取变量——是否接受dec、好感度like、吸引力attr
yuehui1 = yuehui0[["dec","like","attr"]]
yuehui1.shape

## 删除缺失
yuehui1.dropna(axis = 0, how = "any", thresh = None, subset = None, inplace = True)
yuehui1.shape
yuehui1.info()

	iid	id	idg	condtn	wave	round	position	positin1	order	...	attr3_3	sinc3_3	intel3_3	fun3_3	amb3_3	attr5_3	sinc5_3	intel5_3	fun5_3	amb5_3
0	1	1.0	1	1	1	10	7	NaN	4	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
1	1	1.0	1	1	1	10	7	NaN	3	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
2	1	1.0	1	1	1	10	7	NaN	10	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
3	1	1.0	1	1	1	10	7	NaN	5	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
4	1	1.0	1	1	1	10	7	NaN	7	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN

5 rows × 195 columns

(8378, 3)






(8122, 3)



<class 'pandas.core.frame.DataFrame'>
Int64Index: 8122 entries, 0 to 8377
Data columns (total 3 columns):
dec     8122 non-null int64
like    8122 non-null float64
attr    8122 non-null float64
dtypes: float64(2), int64(1)
memory usage: 253.8 KB

## 划分训练和测试
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test = train_test_split(yuehui1[["like","attr"]], yuehui1['dec'], test_size = 0.25, random_state = 0)
print('训练集维度: {}, 测试集维度：{} \n'.format(y_train.shape, y_test.shape))

训练集维度: (6091,), 测试集维度：(2031,)

Sklearn中SVM有LinearSVC、NuSVC和SVC三种方法

LinearSVC

penalty:正则化参数，L1和L2两种参数可选，仅LinearSVC有。
loss:损失函数，有‘hinge’和‘squared_hinge’两种可选，前者又称L1损失，后者称为L2损失，默认是是’squared_hinge’，其中hinge是SVM的标准损失，squared_hinge是hinge的平方。
dual:是否转化为对偶问题求解，默认是True。
tol:残差收敛条件，默认是0.0001，与LR中的一致。
C:惩罚系数，用来控制损失函数的惩罚系数，类似于LR中的正则化系数。
multi_class:负责多分类问题中分类策略制定，有‘ovr’和‘crammer_singer’ 两种参数值可选，默认值是’ovr’，'ovr'的分类原则是将待分类中的某一类当作正类，其他全部归为负类，通过这样求取得到每个类别作为正类时的正确率，取正确率最高的那个类别为正类；‘crammer_singer’ 是直接针对目标函数设置多个参数值，最后进行优化，得到不同类别的参数值大小。
fit_intercept:是否计算截距，与LR模型中的意思一致。
class_weight:与其他模型中参数含义一样，也是用来处理不平衡样本数据的，可以直接以字典的形式指定不同类别的权重，也可以使用balanced参数值。
verbose:是否冗余，默认是False.
random_state:随机种子的大小。
max_iter:最大迭代次数，默认是1000。

对象

coef_:各特征的系数（重要性）。
intercept_:截距的大小（常数值）。

NuSVC

nu:训练误差部分的上限和支持向量部分的下限，取值在（0，1）之间，默认是0.5
kernel:核函数，核函数是用来将非线性问题转化为线性问题的一种方法，默认是“rbf”核函数，常用的核函数有以下几种：linear poly rbf sigmod
degree:当核函数是多项式核函数的时候，用来控制函数的最高次数。（多项式核函数是将低维的输入空间映射到高维的特征空间）
gamma:核函数系数，默认是“auto”，即特征维度的倒数。
coef0:核函数常数值(y=kx+b中的b值)，只有‘poly’和‘sigmoid’核函数有，默认值是0。
max_iter:最大迭代次数，默认值是-1，即没有限制。
probability:是否使用概率估计，默认是False。
decision_function_shape:与'multi_class'参数含义类似。
cache_size:缓冲大小，用来限制计算量大小，默认是200M。

对象

support_:以数组的形式返回支持向量的索引。
support_vectors_:返回支持向量。
n_support_:每个类别支持向量的个数。
dual_coef_:支持向量系数。
coef_:每个特征系数（重要性），只有核函数是LinearSVC的时候可用。
intercept_:截距值（常数值）。

SVC

C:惩罚系数
和NuSVC方法基本一致，唯一区别就是损失函数的度量方式不同

统一方法：

decision_function(X):获取数据集X到分离超平面的距离。
fit(X, y):在数据集(X,y)上使用SVM模型。
get_params([deep]):获取模型的参数。
predict(X):预测数据值X的标签。
score(X,y):返回给定测试集和对应标签的平均准确率。

实例——高斯核SVM

调参技巧——C

C越大，训练样本的准确率越高，泛化能力减低
减小C，泛化能力强。

给定超参数值

# 加载svm模块
from sklearn import svm
my_svm = svm.SVC(C = 88, kernel='rbf')   # 高斯核svm，惩罚系数选择等于1
my_svm.fit(x_train, y_train)   # 调用fit 训练样本

# 查看支持向量
my_svm.support_vectors_    # 支持向量
my_svm.n_support_            # 每一类支持向量的个数(还挺多)

SVC(C=88, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)






array([[7., 7.],
       [8., 6.],
       [8., 6.],
       ...,
       [6., 6.],
       [5., 6.],
       [7., 6.]])






array([1512, 1504], dtype=int32)

## 预测情况
y_pred = my_svm.predict(x_test)   # 计算预测结果

# 模型测试
print('训练集上准确率：'+str(round(my_svm.score(x_train, y_train),4)))  # 训练集上准确率
print('预测集上准确率：'+str(round(my_svm.score(x_test, y_test),4)))    # 预测集上准确率

# 加载混淆矩阵函数
from sklearn.metrics import confusion_matrix
c_m = confusion_matrix(y_test,y_pred)
# 输出混淆矩阵
print(c_m)

训练集上准确率：0.7611
预测集上准确率：0.7622
[[978 192]
 [291 570]]

超参数自动搜索模块GridSearchCV

系统地遍历多种参数组合，通过交叉验证确定最佳效果参数

from sklearn.model_selection import GridSearchCV

# 给出交叉验证参数范围
C_range = np.logspace(-2, 8, 12, base=2)   # 给出参数范围，可以用range，也可以用这个对数等比数列确定C，一定大于0！！
param_grid = [{'kernel': ['rbf'], 'C': C_range}]

# 给出分类器
my_svm1 = svm.SVC(kernel='rbf')    # 高斯核SVM

# 自动搜索最优参数
grid_search= GridSearchCV(my_svm1, param_grid, cv = 3, n_jobs = -1)  #3折交叉

# fit
grid_search.fit(x_train,y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'kernel': ['rbf'], 'C': array([2.50000e-01, 4.69465e-01, 8.81591e-01, 1.65551e+00, 3.10881e+00,
       5.83792e+00, 1.09628e+01, 2.05866e+01, 3.86589e+01, 7.25960e+01,
       1.36325e+02, 2.56000e+02])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

## 预测情况
y_pred2 = grid_search.predict(x_test)   # 计算预测结果

# 模型测试
print('训练集上准确率：'+str(round(grid_search.score(x_train, y_train),4)))  # 训练集上准确率
print('预测集上准确率：'+str(round(grid_search.score(x_test, y_test),4)))    # 预测集上准确率

# 加载混淆矩阵函数
from sklearn.metrics import confusion_matrix
c_m2 = confusion_matrix(y_test,y_pred2)
# 输出混淆矩阵
print(c_m2)

训练集上准确率：0.7577
预测集上准确率：0.7578
[[1004  166]
 [ 326  535]]

实例——LDA

sklearn.discriminant_analysis.LinearDiscriminantAnalysis

参数：

solver：奇异值分解"svd"（默认）、最小二乘"lsqr"和特征分解"eigen"。一般来说特征非常多的时候推荐使用"svd"，而特征不多的时候推荐使用"eigen"
shrinkage :正则化参数,默认值是None
priors:类别权重
n_components:LDA降维时降到的维度。如果不是用于降维，用默认的None。

# 加载和fit
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(solver = "eigen")
lda.fit(x_train, y_train)   # 调用fit 训练样本

# 预测和展示
y_pred3 = lda.predict(x_test)   # 计算预测结果

# 模型测试
print('训练集上准确率：'+str(round(lda.score(x_train, y_train),4)))  # 训练集上准确率
print('预测集上准确率：'+str(round(lda.score(x_test, y_test),4)))    # 预测集上准确率

# 加载混淆矩阵函数
from sklearn.metrics import confusion_matrix
c_m3 = confusion_matrix(y_test,y_pred3)

# 输出混淆矩阵
print(c_m3)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='eigen', store_covariance=False, tol=0.0001)



训练集上准确率：0.7519
预测集上准确率：0.7558
[[862 308]
 [188 673]]

sklearn——LDA&SVM

分类模型——LDA&SVM

读入和处理数据

Sklearn中SVM有LinearSVC、NuSVC和SVC三种方法

LinearSVC

NuSVC

SVC

统一方法：

实例——高斯核SVM

调参技巧——C

给定超参数值

超参数自动搜索模块GridSearchCV

实例——LDA

感谢您的支持，我会继续努力的!

	iid	id	idg	condtn	wave	round	position	positin1	order	...	attr3_3	sinc3_3	intel3_3	fun3_3	amb3_3	attr5_3	sinc5_3	intel5_3	fun5_3	amb5_3
0	1	1.0	1	1	1	10	7	NaN	4	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
1	1	1.0	1	1	1	10	7	NaN	3	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
2	1	1.0	1	1	1	10	7	NaN	10	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
3	1	1.0	1	1	1	10	7	NaN	5	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
4	1	1.0	1	1	1	10	7	NaN	7	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN

	iid	id	idg	condtn	wave	round	position	positin1	order	...	attr3_3	sinc3_3	intel3_3	fun3_3	amb3_3	attr5_3	sinc5_3	intel5_3	fun5_3	amb5_3
0	1	1.0	1	1	1	10	7	NaN	4	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
1	1	1.0	1	1	1	10	7	NaN	3	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
2	1	1.0	1	1	1	10	7	NaN	10	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
3	1	1.0	1	1	1	10	7	NaN	5	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
4	1	1.0	1	1	1	10	7	NaN	7	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN

	iid	id	idg	condtn	wave	round	position	positin1	order	...	attr3_3	sinc3_3	intel3_3	fun3_3	amb3_3	attr5_3	sinc5_3	intel5_3	fun5_3	amb5_3
0	1	1.0	1	1	1	10	7	NaN	4	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
1	1	1.0	1	1	1	10	7	NaN	3	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
2	1	1.0	1	1	1	10	7	NaN	10	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
3	1	1.0	1	1	1	10	7	NaN	5	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN
4	1	1.0	1	1	1	10	7	NaN	7	...	5.0	7.0	7.0	7.0	7.0	NaN	NaN	NaN	NaN	NaN