# 12. More examples on Supervised learning¶

## 12.1. Introduction¶

In this chapter, some more examples are added for Supervised learning.

## 12.2. Visualizing the Iris dataset¶

In this section, we will visualize the dataset using ‘numpy’ and ‘matplotlib’ which is available in the Scikit dataset.

### 12.2.1. Load the Iris dataset¶

• First load the data set and quickly see the contents of it,
# visualization_ex1.py

# plotting the Iris dataset

import numpy as np
import matplotlib.pyplot as plt

print("Keys:", iris.keys()) # print keys of dataset

# shape of data and target
print("Data shape", iris.data.shape) # (150, 4)
print("Target shape", iris.target.shape) # (150,)

print("data:", iris.data[:4]) # first 4 elements

# unique targets
print("Unique targets:", np.unique(iris.target)) # [0, 1, 2]
# counts of each target
print("Bin counts for targets:", np.bincount(iris.target))

print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)

• Below is the output of above code,
$python visualization_ex1.py Keys: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names']) Data shape (150, 4) Target shape (150,) data: [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2]] Unique targets: [0 1 2] Bin counts for targets: [50 50 50] Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Target names: ['setosa' 'versicolor' 'virginica']  ### 12.2.2. Histogram¶ • Let’s plot the histogram of the ‘targets’ with respect to each feature of the dataset,   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # visualization_ex1.py # plotting the Iris dataset import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris iris = load_iris() # load the iris dataset # print("Keys:", iris.keys()) # print keys of dataset # # shape of data and target # print("Data shape", iris.data.shape) # (150, 4) # print("Target shape", iris.target.shape) # (150,) # print("data:", iris.data[:4]) # first 4 elements # # unique targets # print("Unique targets:", np.unique(iris.target)) # [0, 1, 2] # # counts of each target # print("Bin counts for targets:", np.bincount(iris.target)) # print("Feature names:", iris.feature_names) # print("Target names:", iris.target_names) colors = ['blue', 'red', 'green'] # plot histogram for feature in range(iris.data.shape[1]): # (shape = 150, 4) plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0) for label, color in zip(range(len(iris.target_names)), colors): # find the label and plot the corresponding data plt.hist(iris.data[iris.target==label, feature], label=iris.target_names[label], color=color) plt.xlabel(iris.feature_names[feature]) plt.legend() plt.show()  • The Fig. 12.1 shows the histogram of the targets with resepct to each feature. We can clear see that the feature ‘petal widht’ can distinguish the targets better that other features. ### 12.2.3. Scatter plot¶ • Now, we will plot the scatter-plot between ‘petal-width’ and ‘all other features’.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 # visualization_ex1.py # plotting the Iris dataset import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris iris = load_iris() # load the iris dataset # print("Keys:", iris.keys()) # print keys of dataset # # shape of data and target # print("Data shape", iris.data.shape) # (150, 4) # print("Target shape", iris.target.shape) # (150,) # print("data:", iris.data[:4]) # first 4 elements # # unique targets # print("Unique targets:", np.unique(iris.target)) # [0, 1, 2] # # counts of each target # print("Bin counts for targets:", np.bincount(iris.target)) # print("Feature names:", iris.feature_names) # print("Target names:", iris.target_names) colors = ['blue', 'red', 'green'] # # plot histogram # for feature in range(iris.data.shape[1]): # (shape = 150, 4) # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.hist(iris.data[iris.target==label, feature], # label=iris.target_names[label], # color=color) # plt.xlabel(iris.feature_names[feature]) # plt.legend() # plot scatter plot : petal-width vs all features feature_x= 3 # petal width for feature_y in range(iris.data.shape[1]): plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0) for label, color in zip(range(len(iris.target_names)), colors): # find the label and plot the corresponding data plt.scatter(iris.data[iris.target==label, feature_x], iris.data[iris.target==label, feature_y], label=iris.target_names[label], alpha = 0.45, # transparency color=color) plt.xlabel(iris.feature_names[feature_x]) plt.ylabel(iris.feature_names[feature_y]) plt.legend() plt.show()  • The Fig. 12.2 shows the scatter-plots between ‘petal width’ and ‘all other features’. Here we can see that some of the ‘setosa’ can be clearly disntinguish from ‘versicolor’ and ‘virginica’; but the ‘versicolor’ and ‘virginica’ can not be completely separated with each other with any combinations of ‘x’ and ‘y’ axis. ### 12.2.4. Scatter-matrix plot¶ • In Fig. 12.2, we plotted the scatter-plots between ‘petal width’ and ‘all other features’; however, many other combinations are still possible e.g. ‘petal length’ and ‘all other features’. Pandas library provides a method ‘scatter_matrix’, which plots the scatter plot for all the possible combinations along with the histogram, as shown below,   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 # visualization_ex1.py # plotting the Iris dataset import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris iris = load_iris() # load the iris dataset # print("Keys:", iris.keys()) # print keys of dataset # # shape of data and target # print("Data shape", iris.data.shape) # (150, 4) # print("Target shape", iris.target.shape) # (150,) # print("data:", iris.data[:4]) # first 4 elements # # unique targets # print("Unique targets:", np.unique(iris.target)) # [0, 1, 2] # # counts of each target # print("Bin counts for targets:", np.bincount(iris.target)) # print("Feature names:", iris.feature_names) # print("Target names:", iris.target_names) # colors = ['blue', 'red', 'green'] # # plot histogram # for feature in range(iris.data.shape[1]): # (shape = 150, 4) # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.hist(iris.data[iris.target==label, feature], # label=iris.target_names[label], # color=color) # plt.xlabel(iris.feature_names[feature]) # plt.legend() # plot scatter plot : petal-width vs all features # feature_x= 3 # petal width # for feature_y in range(iris.data.shape[1]): # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.scatter(iris.data[iris.target==label, feature_x], # iris.data[iris.target==label, feature_y], # label=iris.target_names[label], # alpha = 0.45, # transparency # color=color) # plt.xlabel(iris.feature_names[feature_x]) # plt.ylabel(iris.feature_names[feature_y]) # plt.legend() # create Pandas-dataframe iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) # print(iris_df.head()) pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8)); plt.show()  • Below are the histogram and scatter plot generated by above code, ### 12.2.5. Fit a model and test accuracy¶ • Next, split the data as ‘training’ and ‘test’ data. Then, we will fit the training-data to the model “KNeighborsClassifier”, and check the accuracy of the model on the test-data.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 # visualization_ex1.py # plotting the Iris dataset import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier iris = load_iris() # load the iris dataset # print("Keys:", iris.keys()) # print keys of dataset # # shape of data and target # print("Data shape", iris.data.shape) # (150, 4) # print("Target shape", iris.target.shape) # (150,) # print("data:", iris.data[:4]) # first 4 elements # # unique targets # print("Unique targets:", np.unique(iris.target)) # [0, 1, 2] # # counts of each target # print("Bin counts for targets:", np.bincount(iris.target)) # print("Feature names:", iris.feature_names) # print("Target names:", iris.target_names) # colors = ['blue', 'red', 'green'] # # plot histogram # for feature in range(iris.data.shape[1]): # (shape = 150, 4) # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.hist(iris.data[iris.target==label, feature], # label=iris.target_names[label], # color=color) # plt.xlabel(iris.feature_names[feature]) # plt.legend() # plot scatter plot : petal-width vs all features # feature_x= 3 # petal width # for feature_y in range(iris.data.shape[1]): # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.scatter(iris.data[iris.target==label, feature_x], # iris.data[iris.target==label, feature_y], # label=iris.target_names[label], # alpha = 0.45, # transparency # color=color) # plt.xlabel(iris.feature_names[feature_x]) # plt.ylabel(iris.feature_names[feature_y]) # plt.legend() # # create Pandas-dataframe # iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) # # print(iris_df.head()) # pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8)); # plt.show() # save 'features' and 'targets' in X and y respectively X, y = iris.data, iris.target # split data into 'test' and 'train' data train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.5, test_size=0.5, random_state=23, stratify=y ) # select classifier cls = KNeighborsClassifier() cls.fit(train_X, train_y) # predict the 'target' for 'test data' pred_y = cls.predict(test_X) test_accuracy = accuracy_score(test_y, pred_y) print("Accuracy for test data:", test_accuracy)  • Below is the accuracy of the model, $ python visualization_ex1.py

Accuracy for test data: 0.946666666667


### 12.2.6. Plot the incorrect prediction¶

• Finally we will plot the incorrectly detected test-samples as shown below,
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 # visualization_ex1.py # plotting the Iris dataset import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier iris = load_iris() # load the iris dataset # print("Keys:", iris.keys()) # print keys of dataset # # shape of data and target # print("Data shape", iris.data.shape) # (150, 4) # print("Target shape", iris.target.shape) # (150,) # print("data:", iris.data[:4]) # first 4 elements # # unique targets # print("Unique targets:", np.unique(iris.target)) # [0, 1, 2] # # counts of each target # print("Bin counts for targets:", np.bincount(iris.target)) # print("Feature names:", iris.feature_names) # print("Target names:", iris.target_names) # colors = ['blue', 'red', 'green'] # # plot histogram # for feature in range(iris.data.shape[1]): # (shape = 150, 4) # plt.subplot(2, 2, feature+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.hist(iris.data[iris.target==label, feature], # label=iris.target_names[label], # color=color) # plt.xlabel(iris.feature_names[feature]) # plt.legend() # plot scatter plot : petal-width vs all features # feature_x= 3 # petal width # for feature_y in range(iris.data.shape[1]): # plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0) # for label, color in zip(range(len(iris.target_names)), colors): # # find the label and plot the corresponding data # plt.scatter(iris.data[iris.target==label, feature_x], # iris.data[iris.target==label, feature_y], # label=iris.target_names[label], # alpha = 0.45, # transparency # color=color) # plt.xlabel(iris.feature_names[feature_x]) # plt.ylabel(iris.feature_names[feature_y]) # plt.legend() # # create Pandas-dataframe # iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) # # print(iris_df.head()) # pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8)); # plt.show() # save 'features' and 'targets' in X and y respectively X, y = iris.data, iris.target # split data into 'test' and 'train' data train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.5, test_size=0.5, random_state=23, stratify=y ) # select classifier cls = KNeighborsClassifier() cls.fit(train_X, train_y) # predict the 'target' for 'test data' pred_y = cls.predict(test_X) # test_accuracy = accuracy_score(test_y, pred_y) # print("Accuracy for test data:", test_accuracy) incorrect_idx = np.where(pred_y != test_y) print('Wrongly detected samples:', incorrect_idx[0]) # scatter plot to show correct and incorrect prediction # plot scatter plot : sepal-width vs all features colors = ['blue', 'orange', 'green'] feature_x= 1 # sepal width for feature_y in range(iris.data.shape[1]): plt.subplot(2, 2, feature_y+1) # subplot starts from 1 (not 0) for i, color in enumerate(colors): # indices for each target i.e. 0, 1 & 2 idx = np.where(test_y == i)[0] # find the label and plot the corresponding data plt.scatter(test_X[idx, feature_x], test_X[idx, feature_y], label=iris.target_names[i], alpha = 0.6, # transparency color=color ) # overwrite the test-data with red-color for wrong prediction plt.scatter(test_X[incorrect_idx, feature_x], test_X[incorrect_idx, feature_y], color="red", marker='^', alpha=0.5, label="Incorrect detection", s=120 # size of marker ) plt.xlabel('{0}'.format(iris.feature_names[feature_x])) plt.ylabel('{0}'.format(iris.feature_names[feature_y])) plt.legend() plt.show() 
• Results for above code are shown in Fig. 12.4. In the two subplots, there are only 3 triangles, as two of these are overlapped with each other; also the overlapped triangles will look darker as we are using the ‘alpha’ parameter.

Overlapped points

The overlapping points can be understood from below results.

• Line 3-4 shows the features of the incorrectly detected targets.
• The Lines 7 and 13 have same ‘sepal-width (col 1)’ and ‘petal-width (col 3)’, therefore two triangles are overlapped in the scatter plot “sepal-width vs petal-width”.
• Similarly, Lines 7 and 13 have the same ‘sepal-width (col 1)’, therefore the triangles are overlapped in the scatter plot of “sepal-width vs sepal-width”.
  1 2 3 4 5 6 7 8 9 10 11 12 13 $python -i visualization_ex1.py >>> print(np.where(pred_y != test_y)[0]) # error locations [11 48 66 72] >>> test_X[11] # see values at error locations array([ 6.1, 3. , 4.9, 1.8]) >>> test_X[48] array([ 6.3, 2.8, 5.1, 1.5]) >>> test_X[66] array([ 6.3, 2.7, 4.9, 1.8]) >>> test_X[72] array([ 6. , 3. , 4.8, 1.8])  ## 12.3. Linear and Nonlinear classification¶ In this section, we see the classification-boundaries of the ‘linear’ and ‘nonlinear’ classification models. ### 12.3.1. Create ‘make_blob’ dataset¶ • Let’s create the dataset ‘make_blob’ with two centers and plot the scatter-plot for it, # make_blob_ex.py import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, y = make_blobs(centers=2, random_state=0) print('X.shape (samples x features):', X.shape) print('y.shape (samples):', y.shape) print('First 5 samples:\n', X[:5, :]) print('First 5 labels:', y[:5]) plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0') plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1') plt.xlabel('first feature') plt.ylabel('second feature') plt.legend() plt.show()  • Below is the output of the above code. The Fig. 12.5 is the scatter plot which is generated by above code, $ python make_blob_ex.py

X.shape (samples x features): (100, 2)

y.shape (samples): (100,)

First 5 samples:
[[ 4.21850347  2.23419161]
[ 0.90779887  0.45984362]
[-0.27652528  5.08127768]
[ 0.08848433  2.32299086]
[ 3.24329731  1.21460627]]

First 5 labels: [1 1 0 0 1]


### 12.3.2. Linear classification¶

Let’s use the model ‘LogisticRegression()’ to perform the linear classification,

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # make_blob_ex.py import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X, y = make_blobs(centers=2, random_state=0) # print('X.shape (samples x features):', X.shape) # print('y.shape (samples):', y.shape) # print('First 5 samples:\n', X[:5, :]) # print('First 5 labels:', y[:5]) # plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0') # plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1') # plt.xlabel('first feature') # plt.ylabel('second feature') # plt.legend() # plt.show() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23, stratify=y) # Linear classifier cls = LogisticRegression() cls.fit(X_train, y_train) prediction = cls.predict(X_test) score = cls.score(X_test, y_test) print("Accuracy:", score) 
• Below is the accuracy for the above model,
$python make_blob_ex.py Accuracy: 0.9  ### 12.3.3. classification boundary for linear classifier¶ Since the model is linear, therefore it will use the ‘straight line’ for defining the boundary for the classification. The boundary can be drawn using ‘plot_2d_separator’ as shown in below code, Listing 12.1 figures.py   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # figures.py import numpy as np import matplotlib.pyplot as plt def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None): if eps is None: eps = X.std() / 2. x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps xx = np.linspace(x_min, x_max, 100) yy = np.linspace(y_min, y_max, 100) X1, X2 = np.meshgrid(xx, yy) X_grid = np.c_[X1.ravel(), X2.ravel()] try: decision_values = classifier.decision_function(X_grid) levels = [0] fill_levels = [decision_values.min(), 0, decision_values.max()] except AttributeError: # no decision_function decision_values = classifier.predict_proba(X_grid)[:, 1] levels = [.5] fill_levels = [0, .5, 1] if ax is None: ax = plt.gca() if fill: ax.contourf(X1, X2, decision_values.reshape(X1.shape), levels=fill_levels, colors=['blue', 'red']) else: ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels, colors="black") ax.set_xlim(x_min, x_max) ax.set_ylim(y_min, y_max) ax.set_xticks(()) ax.set_yticks(())  Listing 12.2 Decision bounadary for linear classifier   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # make_blob_ex.py import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from figures import plot_2d_separator X, y = make_blobs(centers=2, random_state=0) # print('X.shape (samples x features):', X.shape) # print('y.shape (samples):', y.shape) # print('First 5 samples:\n', X[:5, :]) # print('First 5 labels:', y[:5]) # plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0') # plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1') # plt.xlabel('first feature') # plt.ylabel('second feature') # plt.legend() # plt.show() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23, stratify=y) # Linear classifier cls = LogisticRegression() cls.fit(X_train, y_train) prediction = cls.predict(X_test) score = cls.score(X_test, y_test) print("Accuracy:", score) plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1], c='red', s=40, label='0') plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1], c='green', s=40, label='1') plot_2d_separator(cls, X_test) # plot the boundary plt.xlabel('first feature') plt.ylabel('second feature') plt.legend() plt.show()  • The Fig. 12.6 shows the decision boundary generated by above code, ### 12.3.4. Nonlinear classification and boundary¶ Let’s use the nonlinear classifier i.e. ‘KNeighborsClassifier’ and see the decision boundary for it, Listing 12.3 Decision bounadary for nonlinear classifier   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # make_blob_ex.py import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from figures import plot_2d_separator X, y = make_blobs(centers=2, random_state=0) # print('X.shape (samples x features):', X.shape) # print('y.shape (samples):', y.shape) # print('First 5 samples:\n', X[:5, :]) # print('First 5 labels:', y[:5]) # plt.scatter(X[y == 0, 0], X[y == 0, 1], c='red', s=40, label='0') # plt.scatter(X[y == 1, 0], X[y == 1, 1], c='green', s=40, label='1') # plt.xlabel('first feature') # plt.ylabel('second feature') # plt.legend() # plt.show() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23, stratify=y) # Linear classifier # cls = LogisticRegression() # Nonlinear classifier cls = KNeighborsClassifier() cls.fit(X_train, y_train) prediction = cls.predict(X_test) score = cls.score(X_test, y_test) print("Accuracy:", score) plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1], c='red', s=40, label='0') plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1], c='green', s=40, label='1') plot_2d_separator(cls, X_test) # plot the boundary plt.xlabel('first feature') plt.ylabel('second feature') plt.legend() plt.show()  • Below is the output of above code. The Fig. 12.7 shows the nonlinear decision boundary generate by the code. $ python make_blob_ex.py
Accuracy: 1.0


Note

• Now, increase the noise (i.e. cluster_std) in the make_blobs dataset by replacing the Line 10 of Listing 12.3 with below line, and see the decision boundary again,
X, y = make_blobs(centers=2, random_state=0, cluster_std=2.0)

• Note that, we may get multiple boundaries in nonlinear classification, when the noise is high; which will reduce the performance of the system. Those multiple boundaries can be removed by increasing the number of neighbors at Line 35 for ‘KNeighborsClassifier’ as shown below,
cls = KNeighborsClassifier(n_neighbors=25)


Warning

Increasing the ‘n_neighbors’ in ‘KNeighborsClassifier’ does not mean that it will increase the performance all the time. It may reduce the performance as well.

For better results, we must have higher number of samples to reduce the variability in the performance metrics.