7. Dimensionality reduction

7.1. Introduction

During the data collection process, our aim is to collect as much as data possible. During this process, it might possible some of the ‘features’ are correlated. If the dataset has lots of features, then it is good to remove some of the correlated features, so that the data can be processed faster; but at the same time the accuracy of the model may reduced.

7.2. Principal component analysis (PCA)

PCA is one of the technique to reduce the dimensionality of the data, as shown in this section.

7.2.1. Create dataset

  • Lets create a dataset first,
# dimension_ex.py

import numpy as np
import pandas as pd

# feature values
x = np.random.randn(1000)
y = 2*x
z = np.random.randn(1000)

# target values
t=len(x)*[0] # list of len(x)
for i, val in enumerate(z):
    if x[i]+y[i]+z[i] < 0:
        t[i] = 'N' # negative
    else:
        t[i] = 'P'

# create the dataframe
df = np.column_stack((x, y, z, t))
df = pd.DataFrame(df)
print(df.head())

Warning

The output ‘t’ depends on the the variables ‘x’, ‘y’ and ‘z’, therefore if these variables are not correlated, then dimensionality reduction will result in severe performance degradation as shown in this chapter.

  • Following is the output of above code,
$ python dimension_ex.py
                      0                    1                    2  3
0     1.619558594848966    3.239117189697932  -1.7181741395151733  P
1    0.7926656328473467   1.5853312656946934  -0.5003026519806438  P
2  -0.40666904321652636  -0.8133380864330527  -0.5233957097467451  N
3    -1.813173189559588   -3.626346379119176   -1.418416461398814  N
4    0.4357818365640018   0.8715636731280036   1.7840245820080853  P

7.2.2. Reduce dimension using PCA

Now, we create the PCA model as shown Listing 7.1, which will transform the above datasets into a new dataset which will have only 2 features (instead of 3).

Note

The PCA can have inputs which have only ‘numeric features’, therefore we need to ‘drop’ the ‘categorical’ features as shown in Line 26.

Next we need to instantiate an object of class PCA (Line 27) and the apply ‘fit’ method (Line 28).

Finally, we can transform our data using ‘transform’ method as shown in Line 29.

Listing 7.1 Dimensionality reduction using PCA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# dimension_ex.py

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

# feature values
x = np.random.randn(1000)
y = 2*x
z = np.random.randn(1000)

# target values
t=len(x)*[0] # list of len(x)
for i, val in enumerate(z):
    if x[i]+y[i]+z[i] < 0:
        t[i] = 'N' # negative
    else:
        t[i] = 'P'

# create the dataframe
df = np.column_stack((x, y, z, t))
df = pd.DataFrame(df)
# print(df.head())

# dataframe for PCA : PCA can not have 'categorical' features
df_temp = df.drop(3, axis=1) # drop 'categorical' feature
pca = PCA(n_components=2) # 2 dimensional PCA
pca.fit(df_temp)
df_pca = pca.transform(df_temp)
print(df_pca)
  • Following is the output of above code, where the dataset has only two features,
$ python dimension_ex.py
[[-2.54693351 -0.07879497]
 [ 0.42820972 -0.90158131]
 [-1.94145497 -1.70738801]
 ...,
 [-0.92088711  0.54590025]
 [-2.44899588 -1.403821  ]
 [-1.94568343 -0.50371273]]

7.2.3. Compare the performances

Now, we will compare the performances of the system with and without dimensionality reduction.

Note

Please note the following points in this section,

  • If the features are highly correlated, then performance after ‘dimensionality reduction’ will be same as the without ‘dimensionality reduction’.
  • If the features have good correlation, then performance after ‘dimensionality reduction’ will be reduced slightly than the without ‘dimensionality reduction’.
  • If the features have no correlation, then performance after ‘dimensionality reduction’ will be reduced significantly than the without ‘dimensionality reduction’.

The code which is added to Listing 7.1 is exactly same as the code which is discussed in Listing 3.3; i.e. split of dataset into ‘test’ and ‘training’ and then check the score, as shown in below code.

Here Lines 42-70 calculates the score for ‘without dimensionality reduction’ case, whereas Lines 73-103 calculates the score of “dimensionality reduction using PCA”.

Listing 7.2 Dimensionality reduction using PCA
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
# dimension_ex.py

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


# feature values
x = np.random.randn(1000)
y = 2*x
z = np.random.randn(1000)

# target values
t=len(x)*[0] # list of len(x)
for i, val in enumerate(z):
    if x[i]+y[i]+z[i] < 0:
        t[i] = 'N' # negative
    else:
        t[i] = 'P'

# create the dataframe
df = np.column_stack((x, y, z, t))
df = pd.DataFrame(df)
# print(df.head())

# dataframe for PCA : PCA can not have 'categorical' features
df_temp = df.drop(3, axis=1) # drop 'categorical' feature
pca = PCA(n_components=2) # 2 dimensional PCA
pca.fit(df_temp)
df_pca = pca.transform(df_temp)
# print(df_pca)

# assign targets and features values
# targets
targets = df[3]
# features
features = pd.concat([df[0], df[1], df[2]], axis=1)

#### Results for the without reduction case
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# use LogisticRegression
classifier = LogisticRegression()
# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data

print("Without dimensionality reduction:")
# predict the 'target' for 'training data'
prediction_training_targets = classifier.predict(train_features)
self_accuracy = accuracy_score(train_targets, prediction_training_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = classifier.predict(test_features)
test_accuracy = accuracy_score(test_targets, prediction_test_targets)
print("Accuracy for test data:", test_accuracy)



#### Results for the without reduction case
# updated features after dimensionality reduction
features = df_pca
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# use LogisticRegression
classifier = LogisticRegression()
# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data

print("After dimensionality reduction:")
# predict the 'target' for 'training data'
prediction_training_targets = classifier.predict(train_features)
self_accuracy = accuracy_score(train_targets, prediction_training_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = classifier.predict(test_features)
test_accuracy = accuracy_score(test_targets, prediction_test_targets)
print("Accuracy for test data:", test_accuracy)
  • Following is the output of the above code.

Note

Since the ‘x’ and ‘y’ are completely correlated (i.e. y = 2*x), therefore the performance of dimensionality reduction is exactly same as the without reduction case.

Also, we will get different results for different execution of code, as the ‘x’, ‘y’ and ‘z’ are randomly generated on each run.

$ python dimension_ex.py

Without dimensionality reduction:
Accuracy for training data (self accuracy): 0.99875
Accuracy for test data: 1.0

After dimensionality reduction:
Accuracy for training data (self accuracy): 0.99875
Accuracy for test data: 1.0
  • Next replace the value of ‘y’ at Line 13 of Listing 7.2 with following value, and run the code again,
[...]
y = 2*x + np.random.randn(1000)
[...]

As, noise is added to ‘x’ as noise is added, therefore the ‘x’ and ‘y’ are not completely correlated (but still highly correlated), therefore the performance of the system will reduce slightly, as shown in below results,

Note

Remember, the ‘target’ variable depends on ‘x’, ‘y’ and ‘z’ i.e. it is the sign of the sum of these variables. Therefore, if the correlation between the ‘features’ will reduce, the performance of the dimensionality reduction will also reduce.

$ python dimension_ex.py

Without dimensionality reduction:
Accuracy for training data (self accuracy): 0.9925
Accuracy for test data: 0.99

After dimensionality reduction:
Accuracy for training data (self accuracy): 0.9775
Accuracy for test data: 0.97
  • Again, replace the value of ‘y’ at Line 13 of Listing 7.2 with following value, and run the code again,
[...]
y = np.random.randn(1000)
[...]

Now ‘x’, ‘y’ and ‘z’ are completely independent of each other, therefore the performance will reduce significantly as shown below,

Note

Each run will give different result, below is the worst case result, where test data accuracy is 0.575 (i.e. probability 0.5), which is equivalent to the random guess of the target.

$ python dimension_ex.py

Without dimensionality reduction:
Accuracy for training data (self accuracy): 0.995
Accuracy for test data: 0.995

After dimensionality reduction:
Accuracy for training data (self accuracy): 0.64125
Accuracy for test data: 0.575

7.3. Usage of PCA for dimensionality reduction method

Important

Below are the usage of dimensionality reduction technique,

  • Dimensionality reduction is used to reduce the complexity of data.
  • It allows faster data processing, but reduces the accuracy of the model.
  • It can be used as noise reduction process.
  • It can be used as ‘preprocessor of the data’ for the supervised leaning process i.e. regression and classification.

7.4. PCA limitations

Warning

Note that the PCA is very sensitive to scaling operations, more specifically it maximizes variability based on the variances of the features.

Due to this reason, it gives more weight to ‘high variance features i.e. high-variance-feature will dominate the overall performance.

To avoid this problem, it is better to normalized the features before applying the PCA model as shown in Section 8.4.

7.5. Conclusion

In this chapter, we learn the concept of dimensionality reduction and PCA. In the next chapter, we will see the usage of PCA in a practical problem.