经典数据集之IRIS(鸢尾花)数据集


1.概述

安德森鸢尾花卉数据集(Anderson's Iris data set),也称鸢尾花卉数据集(Iris flower data set)或费雪鸢尾花卉数据集(Fisher's Iris data set),是一类多重变量分析的数据集。它最初是埃德加·安德森从加拿大加斯帕半岛上的鸢尾属花朵中提取的形态学变异数据[1],后由罗纳德·费雪作为判别分析的一个例子[2],运用到统计学中。

其数据集包含了150个样本,都属于鸢尾属下的三个亚属,分别是山鸢尾、变色鸢尾和维吉尼亚鸢尾。其中的一个种类与另外两个种类是线性可分离的,后两个种类是非线性可分离的。
该数据集包含了5个属性:

  • Sepal.Length(花萼长度),单位是cm;
  • Sepal.Width(花萼宽度),单位是cm;
  • Petal.Length(花瓣长度),单位是cm;
  • Petal.Width(花瓣宽度),单位是cm;
  • 种类:Iris Setosa(山鸢尾)、Iris Versicolour(变色鸢尾),以及Iris Virginica(维吉尼亚鸢尾)。

2.下载Iris数据集

下面通过scikit-learn导入Iris数据集。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn import datasets
from sklearn.model_selection import train_test_split

#Load iris dataset / 下载iris数据
iris = datasets.load_iris()
print(iris['DESCR']) #打印对Iris数据集的描述
x = iris.data        #特征数据
y = iris.target      #标签数据

#数据集分割为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)

print('输入训练数据维数:', x_train.shape)
print('输入训练数据标签维数:', y_train.shape)
print('输入测试数据维数:', x_test.shape)
print('输入测试数据标签维数:', y_test.shape)

Iris数据集自带数据描述如下。

Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Iris数据集打印输出如下。

输入训练数据维数: (90, 4)
输入训练数据标签维数: (90,)
输入测试数据维数: (60, 4)
输入测试数据标签维数: (60,)

2.Iris数据集可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA

#import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features./选择其中两个特征
y = iris.target

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(8, 6))
plt.clf()

#Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('Sepal length') #花萼长度
plt.ylabel('Sepal width')  #花萼宽度

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

#To getter a better understanding of interaction of the dimensions
#plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector") # 第一特征向量
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector") # 第二特征向量
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector") # 第三特征向量
ax.w_zaxis.set_ticklabels([])

plt.show()

可视化结果见下面图片。

参考文献

  1. Anderson E. The irises of the Gaspe Peninsula[J]. Bulletin of the American Iris society, 1935, 59: 2-5.
  2. Fisher R A. The use of multiple measurements in taxonomic problems[J]. Annals of eugenics, 1936, 7(2): 179-188.
  3. https://zh.wikipedia.org/wiki/安德森鸢尾花卉数据集
  4. https://baike.baidu.com/item/IRIS/4061453
  5. http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

进化学习团队将会根据大家意见和建议持续修改、维护与更新。转载请注明出处(进化学习: https://www.evolutionarylearn.com/paper/dataset-iris/)。

赞赏

微信赞赏支付宝赞赏

Have any Question or Comment?

发表评论

电子邮件地址不会被公开。 必填项已用*标注

热门主题 & 页面

分类目录

博客统计

无点击次数。