12. More examples on Supervised learning

12.1. Introduction

In this chapter, some more examples are added for Supervised learning.

12.2. Visualizing the Iris dataset

In this section, we will visualize the dataset using ‘numpy’ and ‘matplotlib’ which is available in the Scikit dataset.

12.2.1. Load the Iris dataset

  • First load the data set and quickly see the contents of it,
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris() # load the iris dataset
print("Keys:", iris.keys()) # print keys of dataset

# shape of data and target
print("Data shape", iris.data.shape) # (150, 4)
print("Target shape", iris.target.shape) # (150,)

print("data:", iris.data[:4]) # first 4 elements

# unique targets
print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# counts of each target
print("Bin counts for targets:", np.bincount(iris.target))

print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)
  • Below is the output of above code,
$ python visualization_ex1.py

Keys: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

Data shape (150, 4)

Target shape (150,)

data: [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]]

Unique targets: [0 1 2]

Bin counts for targets: [50 50 50]

Feature names: ['sepal length (cm)', 'sepal width (cm)',
                'petal length (cm)', 'petal width (cm)']

Target names: ['setosa' 'versicolor' 'virginica']

12.2.2. Histogram

  • Let’s plot the histogram of the ‘targets’ with respect to each feature of the dataset,
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris() # load the iris dataset
# print("Keys:", iris.keys()) # print keys of dataset

# # shape of data and target
# print("Data shape", iris.data.shape) # (150, 4)
# print("Target shape", iris.target.shape) # (150,)

# print("data:", iris.data[:4]) # first 4 elements

# # unique targets
# print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# # counts of each target
# print("Bin counts for targets:", np.bincount(iris.target))

# print("Feature names:", iris.feature_names)
# print("Target names:", iris.target_names)

colors = ['blue', 'red', 'green']
# plot histogram
for feature in range(iris.data.shape[1]): # (shape = 150, 4)
    plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0)
    for label, color in zip(range(len(iris.target_names)), colors):
        # find the label and plot the corresponding data
        plt.hist(iris.data[iris.target==label, feature],
                 label=iris.target_names[label],
                 color=color)
    plt.xlabel(iris.feature_names[feature])
    plt.legend()
plt.show()
  • The Fig. 12.1 shows the histogram of the targets with resepct to each feature. We can clear see that the feature ‘petal widht’ can distinguish the targets better that other features.
../_images/iris_hist.png

Fig. 12.1 histogram of targets with resepct to each feature

12.2.3. Scatter plot

  • Now, we will plot the scatter-plot between ‘petal-width’ and ‘all other features’.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris() # load the iris dataset
# print("Keys:", iris.keys()) # print keys of dataset

# # shape of data and target
# print("Data shape", iris.data.shape) # (150, 4)
# print("Target shape", iris.target.shape) # (150,)

# print("data:", iris.data[:4]) # first 4 elements

# # unique targets
# print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# # counts of each target
# print("Bin counts for targets:", np.bincount(iris.target))

# print("Feature names:", iris.feature_names)
# print("Target names:", iris.target_names)

colors = ['blue', 'red', 'green']
# # plot histogram
# for feature in range(iris.data.shape[1]): # (shape = 150, 4)
    # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.hist(iris.data[iris.target==label, feature],
                 # label=iris.target_names[label],
                 # color=color)
    # plt.xlabel(iris.feature_names[feature])
    # plt.legend()

# plot scatter plot : petal-width vs all features
feature_x= 3 # petal width
for feature_y in range(iris.data.shape[1]):
    plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0)
    for label, color in zip(range(len(iris.target_names)), colors):
        # find the label and plot the corresponding data
        plt.scatter(iris.data[iris.target==label, feature_x],
                    iris.data[iris.target==label, feature_y],
                    label=iris.target_names[label],
                    alpha = 0.45, # transparency
                    color=color)
    plt.xlabel(iris.feature_names[feature_x])
    plt.ylabel(iris.feature_names[feature_y])
    plt.legend()
plt.show()
  • The Fig. 12.2 shows the scatter-plots between ‘petal width’ and ‘all other features’. Here we can see that some of the ‘setosa’ can be clearly disntinguish from ‘versicolor’ and ‘virginica’; but the ‘versicolor’ and ‘virginica’ can not be completely separated with each other with any combinations of ‘x’ and ‘y’ axis.
../_images/iris_scat.png

Fig. 12.2 Scatter plot : ‘petal-width’ vs ‘all other features’

12.2.4. Scatter-matrix plot

  • In Fig. 12.2, we plotted the scatter-plots between ‘petal width’ and ‘all other features’; however, many other combinations are still possible e.g. ‘petal length’ and ‘all other features’. Pandas library provides a method ‘scatter_matrix’, which plots the scatter plot for all the possible combinations along with the histogram, as shown below,
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris() # load the iris dataset
# print("Keys:", iris.keys()) # print keys of dataset

# # shape of data and target
# print("Data shape", iris.data.shape) # (150, 4)
# print("Target shape", iris.target.shape) # (150,)

# print("data:", iris.data[:4]) # first 4 elements

# # unique targets
# print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# # counts of each target
# print("Bin counts for targets:", np.bincount(iris.target))

# print("Feature names:", iris.feature_names)
# print("Target names:", iris.target_names)

# colors = ['blue', 'red', 'green']
# # plot histogram
# for feature in range(iris.data.shape[1]): # (shape = 150, 4)
    # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.hist(iris.data[iris.target==label, feature],
                 # label=iris.target_names[label],
                 # color=color)
    # plt.xlabel(iris.feature_names[feature])
    # plt.legend()

# plot scatter plot : petal-width vs all features
# feature_x= 3 # petal width
# for feature_y in range(iris.data.shape[1]):
    # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.scatter(iris.data[iris.target==label, feature_x],
                    # iris.data[iris.target==label, feature_y],
                    # label=iris.target_names[label],
                    # alpha = 0.45, # transparency
                    # color=color)
    # plt.xlabel(iris.feature_names[feature_x])
    # plt.ylabel(iris.feature_names[feature_y])
    # plt.legend()

# create Pandas-dataframe
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
# print(iris_df.head())
pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
plt.show()
  • Below are the histogram and scatter plot generated by above code,
../_images/iris_scatter_matrix.png

Fig. 12.3 Scatter matrix for Iris dataset

12.2.5. Fit a model and test accuracy

  • Next, split the data as ‘training’ and ‘test’ data. Then, we will fit the training-data to the model “KNeighborsClassifier”, and check the accuracy of the model on the test-data.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris() # load the iris dataset
# print("Keys:", iris.keys()) # print keys of dataset

# # shape of data and target
# print("Data shape", iris.data.shape) # (150, 4)
# print("Target shape", iris.target.shape) # (150,)

# print("data:", iris.data[:4]) # first 4 elements

# # unique targets
# print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# # counts of each target
# print("Bin counts for targets:", np.bincount(iris.target))

# print("Feature names:", iris.feature_names)
# print("Target names:", iris.target_names)

# colors = ['blue', 'red', 'green']
# # plot histogram
# for feature in range(iris.data.shape[1]): # (shape = 150, 4)
    # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.hist(iris.data[iris.target==label, feature],
                 # label=iris.target_names[label],
                 # color=color)
    # plt.xlabel(iris.feature_names[feature])
    # plt.legend()

# plot scatter plot : petal-width vs all features
# feature_x= 3 # petal width
# for feature_y in range(iris.data.shape[1]):
    # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.scatter(iris.data[iris.target==label, feature_x],
                    # iris.data[iris.target==label, feature_y],
                    # label=iris.target_names[label],
                    # alpha = 0.45, # transparency
                    # color=color)
    # plt.xlabel(iris.feature_names[feature_x])
    # plt.ylabel(iris.feature_names[feature_y])
    # plt.legend()

# # create Pandas-dataframe
# iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
# # print(iris_df.head())
# pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
# plt.show()


# save 'features' and 'targets' in X and y respectively
X, y = iris.data, iris.target

# split data into 'test' and 'train' data
train_X, test_X, train_y, test_y = train_test_split(X, y,
        train_size=0.5,
        test_size=0.5,
        random_state=23,
        stratify=y
    )

# select classifier
cls = KNeighborsClassifier()
cls.fit(train_X, train_y)

# predict the 'target' for 'test data'
pred_y = cls.predict(test_X)
test_accuracy = accuracy_score(test_y, pred_y)
print("Accuracy for test data:", test_accuracy)
  • Below is the accuracy of the model,
$ python visualization_ex1.py

Accuracy for test data: 0.946666666667

12.2.6. Plot the incorrect prediction

  • Finally we will plot the incorrectly detected test-samples as shown below,
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris() # load the iris dataset
# print("Keys:", iris.keys()) # print keys of dataset

# # shape of data and target
# print("Data shape", iris.data.shape) # (150, 4)
# print("Target shape", iris.target.shape) # (150,)

# print("data:", iris.data[:4]) # first 4 elements

# # unique targets
# print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# # counts of each target
# print("Bin counts for targets:", np.bincount(iris.target))

# print("Feature names:", iris.feature_names)
# print("Target names:", iris.target_names)

# colors = ['blue', 'red', 'green']
# # plot histogram
# for feature in range(iris.data.shape[1]): # (shape = 150, 4)
    # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.hist(iris.data[iris.target==label, feature],
                 # label=iris.target_names[label],
                 # color=color)
    # plt.xlabel(iris.feature_names[feature])
    # plt.legend()

# plot scatter plot : petal-width vs all features
# feature_x= 3 # petal width
# for feature_y in range(iris.data.shape[1]):
    # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0)
    # for label, color in zip(range(len(iris.target_names)), colors):
        # # find the label and plot the corresponding data
        # plt.scatter(iris.data[iris.target==label, feature_x],
                    # iris.data[iris.target==label, feature_y],
                    # label=iris.target_names[label],
                    # alpha = 0.45, # transparency
                    # color=color)
    # plt.xlabel(iris.feature_names[feature_x])
    # plt.ylabel(iris.feature_names[feature_y])
    # plt.legend()

# # create Pandas-dataframe
# iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
# # print(iris_df.head())
# pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));
# plt.show()


# save 'features' and 'targets' in X and y respectively
X, y = iris.data, iris.target

# split data into 'test' and 'train' data
train_X, test_X, train_y, test_y = train_test_split(X, y,
        train_size=0.5,
        test_size=0.5,
        random_state=23,
        stratify=y
    )

# select classifier
cls = KNeighborsClassifier()
cls.fit(train_X, train_y)

# predict the 'target' for 'test data'
pred_y = cls.predict(test_X)
# test_accuracy = accuracy_score(test_y, pred_y)
# print("Accuracy for test data:", test_accuracy)

incorrect_idx = np.where(pred_y != test_y)
print('Wrongly detected samples:', incorrect_idx[0])

# scatter plot to show correct and incorrect prediction
# plot scatter plot : sepal-width vs all features
colors = ['blue', 'orange', 'green']
feature_x= 1 # sepal width
for feature_y in range(iris.data.shape[1]):
    plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0)
    for i, color in enumerate(colors):
        # indices for each target i.e. 0, 1 & 2
        idx = np.where(test_y == i)[0]
        # find the label and plot the corresponding data
        plt.scatter(test_X[idx, feature_x],
                    test_X[idx, feature_y],
                    label=iris.target_names[i],
                    alpha = 0.6, # transparency
                    color=color
                    )

    # overwrite the test-data with red-color for wrong prediction
    plt.scatter(test_X[incorrect_idx, feature_x],
            test_X[incorrect_idx, feature_y],
            color="red",
            marker='^',
            alpha=0.5,
            label="Incorrect detection",
            s=120 # size of marker
            )

    plt.xlabel('{0}'.format(iris.feature_names[feature_x]))
    plt.ylabel('{0}'.format(iris.feature_names[feature_y]))
    plt.legend()
plt.show()
  • Results for above code are shown in Fig. 12.4. In the two subplots, there are only 3 triangles, as two of these are overlapped with each other; also the overlapped triangles will look darker as we are using the ‘alpha’ parameter.
../_images/iris_loc_error.png

Fig. 12.4 Correct and incorrect prediction

Overlapped points

The overlapping points can be understood from below results.

  • Line 3-4 shows the features of the incorrectly detected targets.
  • The Lines 7 and 13 have same ‘sepal-width (col 1)’ and ‘petal-width (col 3)’, therefore two triangles are overlapped in the scatter plot “sepal-width vs petal-width”.
  • Similarly, Lines 7 and 13 have the same ‘sepal-width (col 1)’, therefore the triangles are overlapped in the scatter plot of “sepal-width vs sepal-width”.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ python -i visualization_ex1.py

>>> print(np.where(pred_y != test_y)[0]) # error locations
[11 48 66 72]

>>> test_X[11] # see values at error locations
array([ 6.1,  3. ,  4.9,  1.8])
>>> test_X[48]
array([ 6.3,  2.8,  5.1,  1.5])
>>> test_X[66]
array([ 6.3,  2.7,  4.9,  1.8])
>>> test_X[72]
array([ 6. ,  3. ,  4.8,  1.8])

12.3. Linear and Nonlinear classification

In this section, we see the classification-boundaries of the ‘linear’ and ‘nonlinear’ classification models.

12.3.1. Create ‘make_blob’ dataset

  • Let’s create the dataset ‘make_blob’ with two centers and plot the scatter-plot for it,
# make_blob_ex.py

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(centers=2, random_state=0)

print('X.shape (samples x features):', X.shape)
print('y.shape (samples):', y.shape)

print('First 5 samples:\n', X[:5, :])
print('First 5 labels:', y[:5])

plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1')

plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend()
plt.show()
  • Below is the output of the above code. The Fig. 12.5 is the scatter plot which is generated by above code,
$ python make_blob_ex.py

X.shape (samples x features): (100, 2)

y.shape (samples): (100,)

First 5 samples:
 [[ 4.21850347  2.23419161]
 [ 0.90779887  0.45984362]
 [-0.27652528  5.08127768]
 [ 0.08848433  2.32299086]
 [ 3.24329731  1.21460627]]

First 5 labels: [1 1 0 0 1]
../_images/mkblob_sct.png

Fig. 12.5 Scatter plot for make_blob

12.3.2. Linear classification

Let’s use the model ‘LogisticRegression()’ to perform the linear classification,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# make_blob_ex.py

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X, y = make_blobs(centers=2, random_state=0)

# print('X.shape (samples x features):', X.shape)
# print('y.shape (samples):', y.shape)

# print('First 5 samples:\n', X[:5, :])
# print('First 5 labels:', y[:5])

# plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0')
# plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1')

# plt.xlabel('first feature')
# plt.ylabel('second feature')
# plt.legend()
# plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.2,
        random_state=23,
        stratify=y)

# Linear classifier
cls = LogisticRegression()
cls.fit(X_train, y_train)
prediction = cls.predict(X_test)
score = cls.score(X_test, y_test)
print("Accuracy:", score)
  • Below is the accuracy for the above model,
$ python make_blob_ex.py

Accuracy: 0.9

12.3.3. classification boundary for linear classifier

Since the model is linear, therefore it will use the ‘straight line’ for defining the boundary for the classification. The boundary can be drawn using ‘plot_2d_separator’ as shown in below code,

Listing 12.1 figures.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# figures.py

import numpy as np
import matplotlib.pyplot as plt

def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None):
    if eps is None:
        eps = X.std() / 2.
    x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps
    y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps
    xx = np.linspace(x_min, x_max, 100)
    yy = np.linspace(y_min, y_max, 100)

    X1, X2 = np.meshgrid(xx, yy)
    X_grid = np.c_[X1.ravel(), X2.ravel()]
    try:
        decision_values = classifier.decision_function(X_grid)
        levels = [0]
        fill_levels = [decision_values.min(), 0, decision_values.max()]
    except AttributeError:
        # no decision_function
        decision_values = classifier.predict_proba(X_grid)[:, 1]
        levels = [.5]
        fill_levels = [0, .5, 1]

    if ax is None:
        ax = plt.gca()
    if fill:
        ax.contourf(X1, X2, decision_values.reshape(X1.shape),
                    levels=fill_levels, colors=['blue', 'red'])
    else:
        ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels,
                   colors="black")
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xticks(())
    ax.set_yticks(())
Listing 12.2 Decision bounadary for linear classifier
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# make_blob_ex.py

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from figures import plot_2d_separator

X, y = make_blobs(centers=2, random_state=0)

# print('X.shape (samples x features):', X.shape)
# print('y.shape (samples):', y.shape)

# print('First 5 samples:\n', X[:5, :])
# print('First 5 labels:', y[:5])

# plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0')
# plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1')

# plt.xlabel('first feature')
# plt.ylabel('second feature')
# plt.legend()
# plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.2,
        random_state=23,
        stratify=y)

# Linear classifier
cls = LogisticRegression()
cls.fit(X_train, y_train)
prediction = cls.predict(X_test)
score = cls.score(X_test, y_test)
print("Accuracy:", score)

plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
        c='red', s=40, label='0')
plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
        c='green', s=40, label='1')
plot_2d_separator(cls, X_test) # plot the boundary
plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend()
plt.show()
  • The Fig. 12.6 shows the decision boundary generated by above code,
../_images/dc_bon_linear_mkblob.png

Fig. 12.6 Decision boundary for linear classifier

12.3.4. Nonlinear classification and boundary

Let’s use the nonlinear classifier i.e. ‘KNeighborsClassifier’ and see the decision boundary for it,

Listing 12.3 Decision bounadary for nonlinear classifier
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# make_blob_ex.py

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from figures import plot_2d_separator

X, y = make_blobs(centers=2, random_state=0)

# print('X.shape (samples x features):', X.shape)
# print('y.shape (samples):', y.shape)

# print('First 5 samples:\n', X[:5, :])
# print('First 5 labels:', y[:5])

# plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0')
# plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1')

# plt.xlabel('first feature')
# plt.ylabel('second feature')
# plt.legend()
# plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.2,
        random_state=23,
        stratify=y)

# Linear classifier
# cls = LogisticRegression()

# Nonlinear classifier
cls = KNeighborsClassifier()
cls.fit(X_train, y_train)
prediction = cls.predict(X_test)
score = cls.score(X_test, y_test)
print("Accuracy:", score)

plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
        c='red', s=40, label='0')
plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
        c='green', s=40, label='1')
plot_2d_separator(cls, X_test) # plot the boundary
plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend()
plt.show()
  • Below is the output of above code. The Fig. 12.7 shows the nonlinear decision boundary generate by the code.
$ python make_blob_ex.py
Accuracy: 1.0
../_images/dc_bon_nonlinear_mkblob.png

Fig. 12.7 Decision boundary for nonlinear classifier

Note

  • Now, increase the noise (i.e. cluster_std) in the make_blobs dataset by replacing the Line 10 of Listing 12.3 with below line, and see the decision boundary again,
X, y = make_blobs(centers=2, random_state=0, cluster_std=2.0)
  • Note that, we may get multiple boundaries in nonlinear classification, when the noise is high; which will reduce the performance of the system. Those multiple boundaries can be removed by increasing the number of neighbors at Line 35 for ‘KNeighborsClassifier’ as shown below,
cls = KNeighborsClassifier(n_neighbors=25)

Warning

Increasing the ‘n_neighbors’ in ‘KNeighborsClassifier’ does not mean that it will increase the performance all the time. It may reduce the performance as well.

For better results, we must have higher number of samples to reduce the variability in the performance metrics.