2. Multiclass classification

2.1. Introduction

In this chapter, we will use the ‘Iris-dataset’ which is available in the ‘SciKit library’. Here, we will use ‘KNeighborsClassifier’ for training the data and then trained models is used to predict the outputs for the test data. And finally, predicted outputs are compared with the desired outputs.

2.2. Iris-dataset

2.2.1. Load the dataset

Lets see the Iris-dataset which has following features and target available in it, which are show in Listing 2.1.

  • Features:

    • sepal length in cm
    • sepal width in cm
    • petal length in cm
    • petal width in cm
  • Targets:

    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica
Listing 2.1 Iris-dataset
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
>>> iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>> iris.target_names
array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')
>>> iris.data.shape  # 150 samples with 4 features
(150, 4)

2.2.2. Split the data as ‘training’ and ‘test’ data

We have 150 samples in our data. We can divide it into two parts i.e. ‘training dataset’ and ‘testing dataset’. A good choices can be 80% training data and 20% test data.

Important

The training data set must included all the possible ‘targets’ in it, otherwise the machine will not be trained for all the ‘targets’; and will generate huge errors when those datasets will appear in the test. We can use “stratify” in the ‘train_test_split’ which takes care of this, as shown in Listing 2.2.

Here we will use the ‘KNeighborsClasssifier’ class of ‘sklearn’ for training the machine. Lets write the code in the file. Here Lines 17-27 are used to create the training and test datasets. Then Line 36 instantiates an object of KNeighborsClasssifier, which fits the models based on training data at Line 38. Next, the trained model is used to predict the outcome of the test data at Line 40. Finally, prediction error is calculated at Line 44.

Listing 2.2 Training and test data
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# both train_size and test_size are defined when we do not want to
# use all the data for training and testing e.g. in below example we can
# use train_size=0.4 and test_size=0.2
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accurancy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

print("Proportion of 'targets' in the dataset")
print("All data:", np.bincount(train_targets) / float(len(train_targets)))
print("Training:", np.bincount(train_targets) / float(len(train_targets)))
print("Training:", np.bincount(test_targets)/ float(len(test_targets)))


# use KNeighborsClassifier for classification
classifier = KNeighborsClassifier()
# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'test data'
prediction_targets = classifier.predict(test_features)

# check the accuracy of the model
print("Accuracy:", end=' ')
print(np.sum(prediction_targets == test_targets) / float(len(test_targets)))
  • Following are the outputs of the code,
$ python multiclass_ex.py

Proportion of 'targets' in the dataset
All data: [ 0.33333333  0.33333333  0.33333333]
Training: [ 0.33333333  0.33333333  0.33333333]
Training: [ 0.33333333  0.33333333  0.33333333]

Accuracy: 0.933333333333

Note

We need to follow the below steps for training and testing the machine,

  • Get the inputs i.e. ‘features’ from the datasets.
  • Get the desired output i.e. ‘targets’ from the datasets ‘targets’.
  • Next, split the dataset into ‘training’ and ‘testing’ data.
  • Then train the model using ‘fit’ method on the ‘training’ data.
  • Finally, predict the outputs for the ‘test data’, and print and plot the outputs in different formats. This printing and plotting operation will be discussed in next chapter.

2.3. Conclusion

In this chapter, we learn to split the dataset into ‘training’ and ‘test’ data. Then the training data is used to fit the model and finally the models is used for predicting the outputs for the test data for a ‘classification problem’. In the next chapter, we will discuss the ‘binary classification problem’. Also, we will read the from the file, instead of using inbuilt dataset of SciKit.