5. Cross validation

5.1. Introduction

In this chapter, we will enhance the Listing 2.2 to understand the concept of ‘cross validation’. Let’s comment the Line 24 of the Listing 2.2 as shown below and and excute the code 7 times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# both train_size and test_size are defined when we do not want to
# use all the data for training and testing e.g. in below example we can
# use train_size=0.4 and test_size=0.2
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accurancy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        # random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# print("Proportion of 'targets' in the dataset")
# print("All data:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(test_targets)/ float(len(test_targets)))


# use KNeighborsClassifier for classification
classifier = KNeighborsClassifier()
# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'test data'
prediction_targets = classifier.predict(test_features)

# check the accuracy of the model
print("Accuracy:", end=' ')
print(np.sum(prediction_targets == test_targets) / float(len(test_targets)))
  • Now execute the code 7 times and we will get different ‘accuracy’ at different run.
$ python multiclass_ex.py
Accuracy: 0.966666666667

$ python multiclass_ex.py
Accuracy: 1.0

$ python multiclass_ex.py
Accuracy: 1.0

$ python multiclass_ex.py
Accuracy: 0.966666666667

$ python multiclass_ex.py
Accuracy: 1.0

$ python multiclass_ex.py
Accuracy: 0.966666666667

$ python multiclass_ex.py
Accuracy: 0.933333333333

Note

  • The ‘accuracy’ may be changed dramatically for some other datasets for different ‘train’ and ‘test’ dataset. Therefore it is not a good measure to compare the two models.
  • Also, in this method of finding the accuracy, we have very few data as the ‘test-data’. Further, we have less train-data as well due to splitting.

To avoid these problems, the ‘cross-validation’ method is used for calculating the accuracy.

5.2. Cross validation

In the below code, the cross-validation value is set to 7 i.e. ‘cv=7’ at Line 48.

Note

Following are the operations performed by the cross-validation method,

  • The ‘cv=7’ will partition the data into 7 parts.
  • Then it will use ‘first’ part as ‘test set’ and others as ‘training set’.
  • Next, it will use ‘second’ part as ‘test set’ and others as ‘training set’ and so on.
  • In this way, each sample will be in test-dataset exactly one time.
  • Also, in this method, we have more training and testing data.
  • Lastly, we need not to split the data manually in the cross-validation method.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# both train_size and test_size are defined when we do not want to
# use all the data for training and testing e.g. in below example we can
# use train_size=0.4 and test_size=0.2
# train_features, test_features, train_targets, test_targets = train_test_split(
        # features, targets,
        # train_size=0.8,
        # test_size=0.2,
        # # random but same for all run, also accurancy depends on the
        # # selection of data e.g. if we put 10 then accuracy will be 1.0
        # # in this example
        # # random_state=23,
        # # keep same proportion of 'target' in test and target data
        # stratify=targets
    # )

# print("Proportion of 'targets' in the dataset")
# print("All data:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(test_targets)/ float(len(test_targets)))


# use KNeighborsClassifier for classification
classifier = KNeighborsClassifier()
# training using 'training data'
# classifier.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'test data'
# prediction_targets = classifier.predict(test_features)

# check the accuracy of the model
# print("Accuracy:", end=' ')
# print(np.sum(prediction_targets == test_targets) / float(len(test_targets)))

# cross-validation
scores = cross_val_score(classifier, features, targets, cv=7)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
  • Below are the outputs for above code, which are the same for each run,
$ python multiclass_ex.py
Cross validation scores: [ 0.95833333  1.  0.95238095
    0.9047619   0.95238095  1.  1. ]
Mean score: 0.966836734694

$ python multiclass_ex.py
Cross validation scores: [ 0.95833333  1.  0.95238095
     0.9047619   0.95238095  1.  1. ]
Mean score: 0.966836734694

$ python multiclass_ex.py
Cross validation scores: [ 0.95833333  1.  0.95238095
     0.9047619   0.95238095  1.  1. ]
Mean score: 0.966836734694

5.3. Splitting of data

Warning

  • Note that, in cross-validation, the data is not split randomly, therefore it is not good for the data where the ‘targets’ are nicely arranged. Therefore, it is good to shuffle the targets before applying the ‘cross-validation’ as shown in Listing 5.1.
  • Further, it does not create the model to predict the new samples; it only gives an idea about the accuracy of model.
  • It takes time to cross validate the dataset as number of iterations are increased e.g. for cv=7, the data will be split in 7 parts and each part will be tested with respect to others. Further, the data will be iterated 7 times, therefore total 49 checks will be performed.

5.3.1. Manual shuffling

  • Targets can be shuffled manually as below,
Listing 5.1 Shuffle the targets
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# both train_size and test_size are defined when we do not want to
# use all the data for training and testing e.g. in below example we can
# use train_size=0.4 and test_size=0.2
# train_features, test_features, train_targets, test_targets = train_test_split(
        # features, targets,
        # train_size=0.8,
        # test_size=0.2,
        # # random but same for all run, also accurancy depends on the
        # # selection of data e.g. if we put 10 then accuracy will be 1.0
        # # in this example
        # # random_state=23,
        # # keep same proportion of 'target' in test and target data
        # stratify=targets
    # )

# print("Proportion of 'targets' in the dataset")
# print("All data:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(test_targets)/ float(len(test_targets)))


# use KNeighborsClassifier for classification
classifier = KNeighborsClassifier()
# training using 'training data'
# classifier.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'test data'
# prediction_targets = classifier.predict(test_features)

# check the accuracy of the model
# print("Accuracy:", end=' ')
# print(np.sum(prediction_targets == test_targets) / float(len(test_targets)))

print("Targets before shuffle:\n", targets)
rng = np.random.RandomState(0)
permutation = rng.permutation(len(features))
features, targets = features[permutation], targets[permutation]
print("Targets after shuffle:\n", targets)

# cross-validation
scores = cross_val_score(classifier, features, targets, cv=7)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
  • Below is the output of above code. In the iris dataset we have equal number of samples for each target, therefore the effect of shuffle and no-shuffle is almost same, but may vary when targets do not have equal distribution.
$ python multiclass_ex.py
Targets before shuffle:
 [0 0 0 0 0 0 0 ... 0 0 0 0 0
 1 1 1 1 1 1 1 1 ... 1 1 1 1 1
 2 2 2 2 2 2 2 2 ... 2 2 2 2 2
 ]
Targets after shuffle:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 ...
 1 1 1 2 0 2 0 0 1 2 2 2 2 1 2 ...
 1 0 2 1 0 1 2 1 0 2 2 2 2 0 0 ...
 ]

Cross validation scores: [ 1. 0.95238095  0.9047619   1.
    1. 0.95238095 0.95238095]

Mean score: 0.965986394558

5.3.2. Automatic shuffling (KFold, StratifiedKFold and ShuffleSplit)

The shuffling can be performed using inbuilt functions as well as shown in below code.

Note

The data are not shuffled in the Listing 5.2, but chosen random during splitting the data into the ‘training data’ and ‘test data’. Following 3 options are available for splitting (select any one from the Lines 55, 56 and 58),

  • KFold(n_splits=3, shuffle=True) : Shuffle the data and split the data into 3 equal part (same as Listing 5.1).
  • StratifiedKFold(n_splits=3, shuffle=True) : KFold with ‘stratify’ option (see Listing 2.2 for details).
  • ShuffleSplit(n_splits=3, test_size=0.2) : Randomly splits the data. Also, it has the option to define the size of the test data.

Warning

Note that in the Iris data set, the targets are equally distributed, therefore if we use the option KFold(n_splits=3), i.e. no shuffling, then we will have the accuracy ‘0’; as the data will be trained on only one set. Hence it is a good idea to keep shuffle on.

Listing 5.2 KFold, StratifiedKFold and ShuffleSplit
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# both train_size and test_size are defined when we do not want to
# use all the data for training and testing e.g. in below example we can
# use train_size=0.4 and test_size=0.2
# train_features, test_features, train_targets, test_targets = train_test_split(
        # features, targets,
        # train_size=0.8,
        # test_size=0.2,
        # # random but same for all run, also accurancy depends on the
        # # selection of data e.g. if we put 10 then accuracy will be 1.0
        # # in this example
        # # random_state=23,
        # # keep same proportion of 'target' in test and target data
        # stratify=targets
    # )

# print("Proportion of 'targets' in the dataset")
# print("All data:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(train_targets) / float(len(train_targets)))
# print("Training:", np.bincount(test_targets)/ float(len(test_targets)))


# use KNeighborsClassifier for classification
classifier = KNeighborsClassifier()
# training using 'training data'
# classifier.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'test data'
# prediction_targets = classifier.predict(test_features)

# check the accuracy of the model
# print("Accuracy:", end=' ')
# print(np.sum(prediction_targets == test_targets) / float(len(test_targets)))

# print("Targets before shuffle:\n", targets)
# rng = np.random.RandomState(0)
# permutation = rng.permutation(len(features))
# features, targets = features[permutation], targets[permutation]
# print("Targets after shuffle:\n", targets)

# cross-validation
# cv = KFold(n_splits=3, shuffle=True) # shuffle and divide in 3 equal parts
cv = StratifiedKFold(n_splits=3, shuffle=True) # KFold with 'stratify' option
# # test_size is available in ShuffleSplit
# cv = ShuffleSplit(n_splits=3, test_size=0.2)
scores = cross_val_score(classifier, features, targets, cv=cv)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))

Important

In ‘ShuffleSplit’, the data do appear in the ‘test set’ equally.

It is always better to use “KFold with shuffling” i.e. “cv = KFold(n_splits=3, shuffle=True)” or “StratifiedKFold(n_splits=3, shuffle=True)”.

5.4. Template for comparing algorithms

As discussed before, the main usage of cross-validation is to compare various algorithms, which can be done as below, where 4 algorithms (Lines 9-12) are compared.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# cross_valid_ex.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

models = []
models.append(('LogisticRegression', LogisticRegression()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('SVC', SVC()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))

# KFold with 'stratify' option
cv = StratifiedKFold(n_splits=7, shuffle=True, random_state=23)
for name, model in models:
    score = cross_val_score(model, features, targets, cv=cv)
    print("Model:{0}, Score: mean={1:0.5f}, var={2:0.5f}".format(
        name,
        score.mean(),
        score.var()
        )
    )
  • Below is the output of above code, where we can see that SVC performs better than other algorithms.
$ python cross_valid_ex.py
Model:LogisticRegression, Score: mean=0.96088, var=0.00141
Model:KNeighborsClassifier, Score: mean=0.96088, var=0.00141
Model:SVC, Score: mean=0.97449, var=0.00164
Model:DecisionTreeClassifier, Score: mean=0.95408, var=0.00115

Warning

Note that different values of ‘cv’ will give different results, e.g. if we put cv=3 at Line 29 (instead of cv=cv), then we will get following results, which shows that ‘KNeighborsClassifier’ has the best performance.

$ python cross_valid_ex.py
Model:LogisticRegression, Score: mean=0.94690, var=0.00032
Model:KNeighborsClassifier, Score: mean=0.98693, var=0.00009
Model:SVC, Score: mean=0.97345, var=0.00008
Model:DecisionTreeClassifier, Score: mean=0.96732, var=0.00111