3. Binary classification

3.1. Introduction

In Chapter 2, we see the example of ‘classification’, which was performed on the data which was already available in the SciKit. In this chapter, we will read the data from external file. Here the “Hill-Valley ” dataset is used which is available at UCI Repository, which contains 100 input points (i.e. features) in it. Based on these points, the output (i.e. ‘target’) is assigned with one of the two values i.e. “1 for Hill” or “0 for Valley”. Fig. 3.1 shows the graph of these points for the Valley and the hill. Further, we will use “LogisticRegression” model for classification in this chapter. It is a linear model, which finds a line to separate the ‘hill’ from the ‘valley’.

Note that, there are different datasets available on the website i.e. noisy and without noise. In this chapter, we will use the dataset without any noise. Lastly, we can download different data from the website according to our study e.g. data for regression problem, classification problem or mixed problem etc.

../_images/Hill_Valley_visual_examples.jpg

Fig. 3.1 Hill and valley according to the input points

3.2. Dataset

Lets quickly see the contents of the dataset “Hill_Valley_without_noise_Training.data”, as shown in Listing 3.1. The Fig. 3.2 shows the plot of the Rows 10 and 11 of the data, which represents the “hill” and “valley” respectively.

In Listing 3.1, the Lines 12-23 are reading the data, cleaning it (i.e. removing the header line and line-breaks etc.) and changing it into desired format (i.e making list of list and then numpy array). This process is known as Data-cleaning and Data-transformation, which constitute 70%-90% of the work in machine-learning tasks.

Listing 3.1 Quick analysis of data in “Hill_Valley_without_noise_Training.data”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# hill_valley.py

# 1:hill, 0:valley

import matplotlib.pyplot as plt
import numpy as np

f = open("data/Hill_Valley_without_noise_Training.data", 'r')
data = f.read()
f.close()

data = data.split() # split on \n
data = data[1:-1] # remove 0th row as it is header

# save data as list i.e. list of list will be created
data_list = []
for d in data:
    # split on comma
    row = d.split(",")
    data_list.append(row)

# convert list into numpy array, as it allows more direct-operations
data_list = np.array(data_list, float)

print("Number of samples:", len(data_list))
print("(row, column):", data_list.shape) # 100 features + 1 target = 101

# print the last value at row = 10
row = 10
row_last_element = data_list[row][-1] # 1:hill, 0:valley
print("data_list[{0}][100]: {1}".format(row,row_last_element)) # 1

# plot row and row+1 i.e 10 and 11 here
plt.subplot(2,1,1) # plot row
plt.plot(data_list[row][1:-1], label="row = {}".format(row))
plt.legend() # show legends

plt.subplot(2,1,2) # plot row+1
plt.plot(data_list[row+1][1:-1], label="row = {}".format(row+1))
plt.legend() # show legends

plt.show()

Following is the output of the above code,

$ python hill_valley.py

Number of samples: 607
(row, column): (607, 101)
data_list[10][100]: 1.0
../_images/hill_vally_matplotlib.png

Fig. 3.2 Plot for data at Rows 10 and 11

3.3. Extract the data i.e. ‘features’ and ‘targets’

In Chapter 2, it is shown that the machine-learning tasks require the ‘features’ and ‘targets’. In the current data, both are available in the dataset in the combined form i.e. ‘target’ is available at the end of each data sample. Now, our task is to extract the ‘features’ and ‘targets’ in separate variables, so that the further code can be written easily. This can be done as shown in Listing 3.2,

Listing 3.2 Extract the data i.e. ‘features’ and ‘targets’
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# hill_valley.py

# 1:hill, 0:valley

import matplotlib.pyplot as plt
import numpy as np

f = open("data/Hill_Valley_without_noise_Training.data", 'r')
data = f.read()
f.close()

data = data.split() # split on \n
data = data[1:-1] # remove 0th row as it is header

# save data as list i.e. list of list will be created
data_list = []
for d in data:
    # split on comma
    row = d.split(",")
    data_list.append(row)

# convert list into numpy array, as it allows more direct-operations
data_list = np.array(data_list, float)

# print("Number of samples:", len(data_list))
# print("(row, column):", data_list.shape) # 100 features + 1 target = 101

# # print the last value at row = 10
# row = 10
# row_last_element = data_list[row][-1] # 1:hill, 0:valley
# print("data_list[{0}][100]: {1}".format(row,row_last_element)) # 1

# # plot row and row+1 i.e 10 and 11 here
# plt.subplot(2,1,1) # plot row
# plt.plot(data_list[row][1:-1], label="row = {}".format(row))
# plt.legend() # show legends

# plt.subplot(2,1,2) # plot row+1
# plt.plot(data_list[row+1][1:-1], label="row = {}".format(row+1))
# plt.legend() # show legends

# plt.show()


# extract targets
row_sample, col_sample = data_list.shape # extract row and columns in dataset

# features : last column i.e. target value will be removed form the dataset
features = np.zeros((row_sample, col_sample-1), float)
# target : store only last column
targets = np.zeros(row_sample, int)

for i, data in enumerate(data_list):
    targets[i] = data[-1]
    features[i] = data[:-1]
# print(targets)
# print(features)

# recheck the plot
row = 10
plt.subplot(2,1,1) # plot row
plt.plot(features[row], label="row = {}".format(row))
plt.legend() # show legends

plt.subplot(2,1,2) # plot row+1
plt.plot(features[row + 1], label="row = {}".format(row+1))
plt.legend() # show legends

plt.show()

3.4. Prediction

Once data is transformed in the desired format, the prediction task is quite straight forward as shown in Listing 3.3. Here following steps are performed for prediction,

  • Split the data for training and testing (Lines 77-88).
  • Select the classifier for modeling, and fit the data (Lines 90-93).
  • Check the accuracy of prediction for the training set itself (Lines 95-98).
  • Finally check the accuracy of the prediction for the test-data (Lines 100-103).

Note

The ‘accuracy_score’ is used here to calculate the accuracy (see Lines 97 and 102).

Listing 3.3 Prediction
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
# hill_valley.py

# 1:hill, 0:valley

import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


f = open("data/Hill_Valley_without_noise_Training.data", 'r')
data = f.read()
f.close()

data = data.split() # split on \n
data = data[1:-1] # remove 0th row as it is header

# save data as list i.e. list of list will be created
data_list = []
for d in data:
    # split on comma
    row = d.split(",")
    data_list.append(row)

# convert list into numpy array, as it allows more direct-operations
data_list = np.array(data_list, float)

# print("Number of samples:", len(data_list))
# print("(row, column):", data_list.shape) # 100 features + 1 target = 101

# # print the last value at row = 10
# row = 10
# row_last_element = data_list[row][-1] # 1:hill, 0:valley
# print("data_list[{0}][100]: {1}".format(row,row_last_element)) # 1

# # plot row and row+1 i.e 10 and 11 here
# plt.subplot(2,1,1) # plot row
# plt.plot(data_list[row][1:-1], label="row = {}".format(row))
# plt.legend() # show legends

# plt.subplot(2,1,2) # plot row+1
# plt.plot(data_list[row+1][1:-1], label="row = {}".format(row+1))
# plt.legend() # show legends

# plt.show()


# extract targets
row_sample, col_sample = data_list.shape # extract row and columns in dataset

# features : last column i.e. target value will be removed form the dataset
features = np.zeros((row_sample, col_sample-1), float)
# target : store only last column
targets = np.zeros(row_sample, int)

for i, data in enumerate(data_list):
    targets[i] = data[-1]
    features[i] = data[:-1]
# print(targets)
# print(features)

# # recheck the plot
# row = 10
# plt.subplot(2,1,1) # plot row
# plt.plot(features[row], label="row = {}".format(row))
# plt.legend() # show legends

# plt.subplot(2,1,2) # plot row+1
# plt.plot(features[row + 1], label="row = {}".format(row+1))
# plt.legend() # show legends

# plt.show()


# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# use LogisticRegression
classifier = LogisticRegression()
# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'training data'
prediction_training_targets = classifier.predict(train_features)
self_accuracy = accuracy_score(train_targets, prediction_training_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = classifier.predict(test_features)
test_accuracy = accuracy_score(test_targets, prediction_test_targets)
print("Accuracy for test data:", test_accuracy)

Following are the results for the above code,

$ python hill_valley.py
Accuracy for training data (self accuracy): 0.997933884298
Accuracy for test data: 1.0

Note

In Iris-data set in Chapter 2 , the target depends directly on the input features i.e. width and length of petal and sepal. But in Hill-valley problem, the output does not directly depends on the location of the input values, but on the relative-positions of the certain inputs with all other inputs.

LogisticRegression assign a weight to each of the features and then calculate the sum for making decisions e.g. if sum is greater than 0 then ‘hill’ and if less than 0 then ‘valley’. The coefficients which are assigned to each feature can be seen as below,

$ python -i hill_valley.py

Accuracy for training data (self accuracy): 0.997933884298
Accuracy for test data: 1.0
>>> classifier.coef_
array([[-0.75630448, -0.70813863, -0.64901487, -0.57633845, -0.48687761,
        [...]
        -0.6593235 , -0.719707  , -0.76843887, -0.8077998 , -0.83961794]])

Also, the KNeighborsClassifier will not work here, as it looks for the features which are nearer to the ‘targets’, and then decide the boundaries. But, in Hill-Valley case, a valley can be at the top of the graph as shown in Fig. 3.1, or at the bottom of the graph. Similarly a Hill can be at the top of graph or at the bottom location. Therefore it is not possible to find the nearest points for the Hill-Valley problem, which can distinguish a Hill from a Vally. Hence, KNeighborsClassifier will have the accuracy_score = 0.5 (i.e. random guess). We can verify it by importing the “KNeighborsClassifier” and replacing the “LogisticRegression” to “KNeighborsClassifier” in Listing 3.3.

3.5. Rock vs Mine example

The file “sonar.all-data” contains the patterns obtained by bouncing sonar signals off a metal cylinder and the rocks under similar conditions. Last column contains the target names i.e. ‘R’ and ‘M’, where ‘R’ and ‘M’ are rocks and metals respectively.

Note

Remember that, in classification problems the targets must be descrete; and can have the value as ‘string’ or ‘number’ as shown in Table 1.4.

As oppose to previous section, here the ‘targets’ has the direct relationship with ‘features’, therefore we can use both the classifier i.e. “LogisticRegression” and “KNeighborsClassifier” as shown in Listing 3.4.

Since, the target is not the numeric value, therefore targets are stored in the list as shown in Line 33 (instead of numpy-array). Select any one of the classifier from Lines 55-56 and run the code to see the prediction accuracy.

Listing 3.4 Rock vs Mine
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# rock_mine.py

# 'R': Rock, 'M': Mine

import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


f = open("data/sonar.all-data", 'r')
data = f.read()
f.close()

data = data.split() # split on \n

# save data as list i.e. list of list will be created
data_list = []
for d in data:
    # split on comma
    row = d.split(",")
    data_list.append(row)

# extract targets
row_sample, col_sample = len(data_list), len(data_list[0])

# features : last column i.e. target value will be removed form the dataset
features = np.zeros((row_sample, col_sample-1), float)
# target : store only last column
targets = []  # targets are 'R' and 'M'

for i, data in enumerate(data_list):
    targets.append(data[-1])
    features[i] = data[:-1]
# print(targets)
# print(features)

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# select classifier
classifier = LogisticRegression()
# classifier = KNeighborsClassifier()

# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'training data'
prediction_training_targets = classifier.predict(train_features)
self_accuracy = accuracy_score(train_targets, prediction_training_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = classifier.predict(test_features)
test_accuracy = accuracy_score(test_targets, prediction_test_targets)
print("Accuracy for test data:", test_accuracy)

Following are the outputs for the above code,

(for LogisticRegression)
$ python rock_mine.py
Accuracy for training data (self accuracy): 0.795180722892
Accuracy for test data: 0.761904761905

(for KNeighborsClassifier)
$ python rock_mine.py
Accuracy for training data (self accuracy): 0.843373493976
Accuracy for test data: 0.785714285714

3.6. Conclusion

In this chapter, we read the data from the file, and then converted the data into the format which is used by SciKit library for further operations. Further, we used the class ‘LogisticRegression’ for modeling the system, and check the accuracy of the model for the training and test data.