数据预处理演示

1203 views

目录:

实验六、数据预处理演示

实验六、数据预处理演示

一、实验目的

1.了解数据集的 z-score 标准化。
2.掌握数据集的不同降维方式。

二、实验内容

1.调用 StandardScaler 进行数据集的 z-score 标准化。
2.调用 PCA 和 LDA 进行数据集的降维。

三、实验步骤

1、数据降维关于数据降维，sklearn 库提供了常见的2种降维方式：PCA 和 LDA。若想要知道数据的方差比，可直接在不降维的数据上使用 explained_variance_ratio_查看方差比决定降多少维数据。为了验证降维效果，使用倒序循环，维度从高到低降维，观察分类器分数。
2、PCA 无监督降维。根据不同维度的方差比降维，把方差比明显小于某个值的对应特征降维，这种方法在某些数据里面效果不好。
3、LDA 有监督降维，目标是数据在低维度上进行投影，投影后每一种类别数据的投影点尽可能的接近，而不同类别的数据的类别中心之间的距离尽可能的大。在类标数量远小于特征数时效果很好。最多降到类别数 k-1 的维数。需要注意的是进行 LDA 降维前，不要对数据进行预处理，以免降低数据分类准确度。

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)
x = df.loc[:, 1:].values
y = df.loc[:, 0].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

fullPCA = PCA()
fullPCA.fit(x_train, y_train)

print(fullPCA.explained_variance_ratio_)

for i in range(1, x.shape[1])[::-1]:
    pipe_lr = Pipeline([
        ('scl', StandardScaler()), ('pca', PCA(n_components=i)),
        ('clr', LogisticRegression())
    ])
    pipe_lr.fit(x_train, y_train)
    print("无监督降维", pipe_lr.score(x_test, y_test))

fullLDA = LinearDiscriminantAnalysis()
fullLDA.fit(x_train, y_train)

print(fullLDA.explained_variance_ratio_)

for i in range(1, x.shape[1])[::-1]:
    pipe_lr = Pipeline([
        ('reduce_dim', LinearDiscriminantAnalysis()),
        ('clr', LogisticRegression())
    ])
    pipe_lr.fit(x_train, y_train)
    print("监督降维", pipe_lr.score(x_test, y_test))