8. Preprocessing of the data using Pandas and SciKit

In previous chapters, we did some minor preprocessing to the data, so that it can be used by SciKit library. In this chapter, we will do some preprocessing of the data to change the ‘statitics’ and the ‘format’ of the data, to improve the results of the data analysis.

8.1. Chronic kidney disease

The “chronic_kidney_disease.arff” dataset is used for this tutorial, which is available at the UCI Repository.

  • Lets read and clean the data first,
Listing 8.1 Read the data
# kidney_dis.py

import pandas as pd
import numpy as np

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
print("Total samples:", len(df))
# print 4-rows and 6-columns
print("Partial data\n", df.iloc[0:4, 0:6])
  • Below is the output of above code,
$ python kidney_dis.py
Total samples: 157
Partial data
    age  bp     sg al su       rbc
30  48  70  1.005  4  0    normal
36  53  90  1.020  2  0  abnormal
38  63  70  1.010  3  0  abnormal
41  68  80  1.010  3  2    normal

8.2. Saving targets with different color names

In this dataset we have two ‘targets’ i.e. ‘ckd’ and ‘notckd’ in the last column (‘classification’). It is better to save the ‘targets’ of classification problem with some ‘color-name’ for the plotting purposes. This helps in visualizing the scatter-plot as shown in this chapter.

Listing 8.2 Alias the ‘target-values’ with ‘color-values’
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# kidney_dis.py

import pandas as pd
import numpy as np

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
print(label_color[0:3], label_color[-3:-1])

Note

We can convert the ‘categorical-targets (i.e. strings ‘ckd’ and ‘notckd’) into ‘numeric-targets (i.e. 0 and 1’) using “.cat.codes” command, as shown below,

# covert 'ckd' and 'notckd' labels as '0' and '1'
targets = df['classification'].astype('category').cat.codes
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i==0 else 'green' for i in targets]
print(label_color[0:3], label_color[-3:-1])
  • Below is the first three and last two samples of the ‘label_color’,
$ python kidney_dis.py
['red', 'red', 'red'] ['green', 'green']

8.3. Basic PCA analysis

Let’s perform the dimensionality reduction using PCA, which is discussed in Section 7.2.

8.3.1. Preparing data for PCA analysis

Note that, for PCA the features should be ‘numerics’ only. Therefore we need to remove the ‘categorical’ features from the dataset.

Listing 8.3 Drop categorical features
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# kidney_dis.py

import pandas as pd
import numpy as np

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])

# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
        'dm', 'cad', 'appet', 'pe', 'ane'
        ]

# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data
  • Below is the output of the above code. Note that, if we compare the below results with the results of Listing 8.1, we can see that the ‘rbc’ column is removed.
$ python kidney_dis.py
Partial data
    age  bp     sg al su  bgr
30  48  70  1.005  4  0  117
36  53  90  1.020  2  0   70
38  63  70  1.010  3  0  380
41  68  80  1.010  3  2  157

8.3.2. dimensionality reduction

Let’s perform dimensionality reduction using the PCA model as shown in Listing 8.4. The results are shown in Fig. 8.1, where we can see that the model can fairly classify the kidney disease based on the provided features. In the next section we will improve this performance by some more preprocessing of the data.

Listing 8.4 dimensionality reduction using PCA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# kidney_dis.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])

# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
        'dm', 'cad', 'appet', 'pe', 'ane'
        ]

# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
# print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data

pca = PCA(n_components=2)
pca.fit(df)
T = pca.transform(df) # transformed data
# change 'T' to Pandas-DataFrame to plot using Pandas-plots
T = pd.DataFrame(T)

# plot the data
T.columns = ['PCA component 1', 'PCA component 2']
T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o',
        alpha=0.7, # opacity
        color=label_color,
        title="red: ckd, green: not-ckd" )
plt.show()
../_images/pca_basic_kidney.png

Fig. 8.1 Chronic Kidney Disease

8.4. Preprocessing using SciKit library

It is shown in Section 7.4, that the overall performance of the PCA is dominated by ‘high variance features’. Therefore features should be normalized before using the PCA model.

In the below code ‘StandardScalar’ preprocessing module is used to normalized the features, which sets the ‘mean=0’ and ‘variance=1’ for all the features. Note that the improvement in the results in Fig. 8.2, just by adding one line in Listing 8.5.

Important

Currently, we are using preprocessing for the ‘unsupervised learning’.

If we want to use the preprocessing in the ‘supervised learning’, then it is better to ‘split’ the dataset as ‘test and train’ first; and then apply the preprocessing to the ‘training data’ only. This is the good practice as in real-life problems we will not have the future data for preprocessing.

Listing 8.5 Scale the features using “StandardScalar”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# kidney_dis.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn import preprocessing

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])

# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
        'dm', 'cad', 'appet', 'pe', 'ane'
        ]

# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
# print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data

# StandardScaler: mean=0, variance=1
df = preprocessing.StandardScaler().fit_transform(df)

pca = PCA(n_components=2)
pca.fit(df)
T = pca.transform(df) # transformed data
# change 'T' to Pandas-DataFrame to plot using Pandas-plots
T = pd.DataFrame(T)

# plot the data
T.columns = ['PCA component 1', 'PCA component 2']
T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o',
        alpha=0.7, # opacity
        color=label_color,
        title="red: ckd, green: not-ckd" )
plt.show()
../_images/kidney_pca_sci_prepro.png

Fig. 8.2 Chronic Kidney Disease results using “StandardScalar”

8.5. Preprocessing using Pandas library

Note that, in Section 8.3.1, we dropped several ‘categorical features’ as these can not be used by PCA. But we can convert these features to ‘numeric features’ and use them in PCA model.

Again, see the further improvement in the results in Fig. 8.3, just by adding one line in Listing 8.6.

Listing 8.6 Convert ‘categorical features’ to ‘numeric features’ using Pandas
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# kidney_dis.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn import preprocessing

# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
    'ba','bgr','bu','sc','sod','pot','hemo','pcv',
    'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
    'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
        header=None,
        names=header
       )
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease,  green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])

# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
        'dm', 'cad', 'appet', 'pe', 'ane'
        ]

# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
# df.drop(labels=categorical_, axis=1, inplace=True)

# convert categorical features into dummy variable
df = pd.get_dummies(df, columns=categorical_)
# print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data

# StandardScaler: mean=0, variance=1
df = preprocessing.StandardScaler().fit_transform(df)

pca = PCA(n_components=2)
pca.fit(df)
T = pca.transform(df) # transformed data
# change 'T' to Pandas-DataFrame to plot using Pandas-plots
T = pd.DataFrame(T)

# plot the data
T.columns = ['PCA component 1', 'PCA component 2']
T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o',
        alpha=0.7, # opacity
        color=label_color,
        title="red: ckd, green: not-ckd" )
plt.show()
../_images/pca_kid_get_dum.png

Fig. 8.3 Chronic Kidney Disease results using “get_dummies”

Important

Let’s summarize what we did in this chapter. We had a dataset which had a large number of features. PCA looks for the correlation between these features and reduces the dimensionality. In this example, we reduce the number of features to 2 using PCA.

After the dimensionality reduction, we had only 2 features, therefore we can plot the scatter-plot, which is easier to visualize. For example, we can clearly see the differences between the ‘ckd’ and ‘not ckd’ in the current example.

In conclusion, dimensionality reduction methods, such as PCA and Isomap, are used to reduce the dimensionality of the features to 2 or 3. Next, these 2 or 3 features can be plotted to visualize the information.

It is important that the plot should be 2D or 3D format, otherwise it is very difficult for the eyes to visualize it and interpret the information.