# 8. Preprocessing of the data using Pandas and SciKit¶

In previous chapters, we did some minor preprocessing to the data, so that it can be used by SciKit library. In this chapter, we will do some preprocessing of the data to change the ‘statitics’ and the ‘format’ of the data, to improve the results of the data analysis.

## 8.1. Chronic kidney disease¶

The “chronic_kidney_disease.arff” dataset is used for this tutorial, which is available at the UCI Repository.

• Lets read and clean the data first,
# kidney_dis.py

import pandas as pd
import numpy as np

'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'classification']
)
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples
print("Total samples:", len(df))
# print 4-rows and 6-columns
print("Partial data\n", df.iloc[0:4, 0:6])

• Below is the output of above code,
$python kidney_dis.py Total samples: 157 Partial data age bp sg al su rbc 30 48 70 1.005 4 0 normal 36 53 90 1.020 2 0 abnormal 38 63 70 1.010 3 0 abnormal 41 68 80 1.010 3 2 normal  ## 8.2. Saving targets with different color names¶ In this dataset we have two ‘targets’ i.e. ‘ckd’ and ‘notckd’ in the last column (‘classification’). It is better to save the ‘targets’ of classification problem with some ‘color-name’ for the plotting purposes. This helps in visualizing the scatter-plot as shown in this chapter. Listing 8.2 Alias the ‘target-values’ with ‘color-values’   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # kidney_dis.py import pandas as pd import numpy as np # create header for dataset header = ['age','bp','sg','al','su','rbc','pc','pcc', 'ba','bgr','bu','sc','sod','pot','hemo','pcv', 'wbcc','rbcc','htn','dm','cad','appet','pe','ane', 'classification'] # read the dataset df = pd.read_csv("data/chronic_kidney_disease.arff", header=None, names=header ) # dataset has '?' in it, convert these into NaN df = df.replace('?', np.nan) # drop the NaN df = df.dropna(axis=0, how="any") # print total samples # print("Total samples:", len(df)) # print 4-rows and 6-columns # print("Partial data\n", df.iloc[0:4, 0:6]) targets = df['classification'].astype('category') # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i=='ckd' else 'green' for i in targets] print(label_color[0:3], label_color[-3:-1])  Note We can convert the ‘categorical-targets (i.e. strings ‘ckd’ and ‘notckd’) into ‘numeric-targets (i.e. 0 and 1’) using “.cat.codes” command, as shown below, # covert 'ckd' and 'notckd' labels as '0' and '1' targets = df['classification'].astype('category').cat.codes # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i==0 else 'green' for i in targets] print(label_color[0:3], label_color[-3:-1])  • Below is the first three and last two samples of the ‘label_color’, $ python kidney_dis.py
['red', 'red', 'red'] ['green', 'green']


## 8.3. Basic PCA analysis¶

Let’s perform the dimensionality reduction using PCA, which is discussed in Section 7.2.

### 8.3.1. Preparing data for PCA analysis¶

Note that, for PCA the features should be ‘numerics’ only. Therefore we need to remove the ‘categorical’ features from the dataset.

Listing 8.3 Drop categorical features
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 # kidney_dis.py import pandas as pd import numpy as np # create header for dataset header = ['age','bp','sg','al','su','rbc','pc','pcc', 'ba','bgr','bu','sc','sod','pot','hemo','pcv', 'wbcc','rbcc','htn','dm','cad','appet','pe','ane', 'classification'] # read the dataset df = pd.read_csv("data/chronic_kidney_disease.arff", header=None, names=header ) # dataset has '?' in it, convert these into NaN df = df.replace('?', np.nan) # drop the NaN df = df.dropna(axis=0, how="any") # print total samples # print("Total samples:", len(df)) # print 4-rows and 6-columns # print("Partial data\n", df.iloc[0:4, 0:6]) targets = df['classification'].astype('category') # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i=='ckd' else 'green' for i in targets] # print(label_color[0:3], label_color[-3:-1]) # list of categorical features categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane' ] # drop the "categorical" features # drop the classification column df = df.drop(labels=['classification'], axis=1) # drop using 'inplace' which is equivalent to df = df.drop() df.drop(labels=categorical_, axis=1, inplace=True) print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data 
• Below is the output of the above code. Note that, if we compare the below results with the results of Listing 8.1, we can see that the ‘rbc’ column is removed.
\$ python kidney_dis.py
Partial data
age  bp     sg al su  bgr
30  48  70  1.005  4  0  117
36  53  90  1.020  2  0   70
38  63  70  1.010  3  0  380
41  68  80  1.010  3  2  157


### 8.3.2. dimensionality reduction¶

Let’s perform dimensionality reduction using the PCA model as shown in Listing 8.4. The results are shown in Fig. 8.1, where we can see that the model can fairly classify the kidney disease based on the provided features. In the next section we will improve this performance by some more preprocessing of the data.

Listing 8.4 dimensionality reduction using PCA
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 # kidney_dis.py import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA # create header for dataset header = ['age','bp','sg','al','su','rbc','pc','pcc', 'ba','bgr','bu','sc','sod','pot','hemo','pcv', 'wbcc','rbcc','htn','dm','cad','appet','pe','ane', 'classification'] # read the dataset df = pd.read_csv("data/chronic_kidney_disease.arff", header=None, names=header ) # dataset has '?' in it, convert these into NaN df = df.replace('?', np.nan) # drop the NaN df = df.dropna(axis=0, how="any") # print total samples # print("Total samples:", len(df)) # print 4-rows and 6-columns # print("Partial data\n", df.iloc[0:4, 0:6]) targets = df['classification'].astype('category') # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i=='ckd' else 'green' for i in targets] # print(label_color[0:3], label_color[-3:-1]) # list of categorical features categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane' ] # drop the "categorical" features # drop the classification column df = df.drop(labels=['classification'], axis=1) # drop using 'inplace' which is equivalent to df = df.drop() df.drop(labels=categorical_, axis=1, inplace=True) # print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data pca = PCA(n_components=2) pca.fit(df) T = pca.transform(df) # transformed data # change 'T' to Pandas-DataFrame to plot using Pandas-plots T = pd.DataFrame(T) # plot the data T.columns = ['PCA component 1', 'PCA component 2'] T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o', alpha=0.7, # opacity color=label_color, title="red: ckd, green: not-ckd" ) plt.show() 

## 8.4. Preprocessing using SciKit library¶

It is shown in Section 7.4, that the overall performance of the PCA is dominated by ‘high variance features’. Therefore features should be normalized before using the PCA model.

In the below code ‘StandardScalar’ preprocessing module is used to normalized the features, which sets the ‘mean=0’ and ‘variance=1’ for all the features. Note that the improvement in the results in Fig. 8.2, just by adding one line in Listing 8.5.

Important

Currently, we are using preprocessing for the ‘unsupervised learning’.

If we want to use the preprocessing in the ‘supervised learning’, then it is better to ‘split’ the dataset as ‘test and train’ first; and then apply the preprocessing to the ‘training data’ only. This is the good practice as in real-life problems we will not have the future data for preprocessing.

Listing 8.5 Scale the features using “StandardScalar”
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 # kidney_dis.py import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn import preprocessing # create header for dataset header = ['age','bp','sg','al','su','rbc','pc','pcc', 'ba','bgr','bu','sc','sod','pot','hemo','pcv', 'wbcc','rbcc','htn','dm','cad','appet','pe','ane', 'classification'] # read the dataset df = pd.read_csv("data/chronic_kidney_disease.arff", header=None, names=header ) # dataset has '?' in it, convert these into NaN df = df.replace('?', np.nan) # drop the NaN df = df.dropna(axis=0, how="any") # print total samples # print("Total samples:", len(df)) # print 4-rows and 6-columns # print("Partial data\n", df.iloc[0:4, 0:6]) targets = df['classification'].astype('category') # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i=='ckd' else 'green' for i in targets] # print(label_color[0:3], label_color[-3:-1]) # list of categorical features categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane' ] # drop the "categorical" features # drop the classification column df = df.drop(labels=['classification'], axis=1) # drop using 'inplace' which is equivalent to df = df.drop() df.drop(labels=categorical_, axis=1, inplace=True) # print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data # StandardScaler: mean=0, variance=1 df = preprocessing.StandardScaler().fit_transform(df) pca = PCA(n_components=2) pca.fit(df) T = pca.transform(df) # transformed data # change 'T' to Pandas-DataFrame to plot using Pandas-plots T = pd.DataFrame(T) # plot the data T.columns = ['PCA component 1', 'PCA component 2'] T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o', alpha=0.7, # opacity color=label_color, title="red: ckd, green: not-ckd" ) plt.show() Fig. 8.2 Chronic Kidney Disease results using “StandardScalar”

## 8.5. Preprocessing using Pandas library¶

Note that, in Section 8.3.1, we dropped several ‘categorical features’ as these can not be used by PCA. But we can convert these features to ‘numeric features’ and use them in PCA model.

Again, see the further improvement in the results in Fig. 8.3, just by adding one line in Listing 8.6.

Listing 8.6 Convert ‘categorical features’ to ‘numeric features’ using Pandas
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 # kidney_dis.py import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn import preprocessing # create header for dataset header = ['age','bp','sg','al','su','rbc','pc','pcc', 'ba','bgr','bu','sc','sod','pot','hemo','pcv', 'wbcc','rbcc','htn','dm','cad','appet','pe','ane', 'classification'] # read the dataset df = pd.read_csv("data/chronic_kidney_disease.arff", header=None, names=header ) # dataset has '?' in it, convert these into NaN df = df.replace('?', np.nan) # drop the NaN df = df.dropna(axis=0, how="any") # print total samples # print("Total samples:", len(df)) # print 4-rows and 6-columns # print("Partial data\n", df.iloc[0:4, 0:6]) targets = df['classification'].astype('category') # save target-values as color for plotting # red: disease, green: no disease label_color = ['red' if i=='ckd' else 'green' for i in targets] # print(label_color[0:3], label_color[-3:-1]) # list of categorical features categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane' ] # drop the "categorical" features # drop the classification column df = df.drop(labels=['classification'], axis=1) # drop using 'inplace' which is equivalent to df = df.drop() # df.drop(labels=categorical_, axis=1, inplace=True) # convert categorical features into dummy variable df = pd.get_dummies(df, columns=categorical_) # print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data # StandardScaler: mean=0, variance=1 df = preprocessing.StandardScaler().fit_transform(df) pca = PCA(n_components=2) pca.fit(df) T = pca.transform(df) # transformed data # change 'T' to Pandas-DataFrame to plot using Pandas-plots T = pd.DataFrame(T) # plot the data T.columns = ['PCA component 1', 'PCA component 2'] T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o', alpha=0.7, # opacity color=label_color, title="red: ckd, green: not-ckd" ) plt.show() Fig. 8.3 Chronic Kidney Disease results using “get_dummies”

Important

Let’s summarize what we did in this chapter. We had a dataset which had a large number of features. PCA looks for the correlation between these features and reduces the dimensionality. In this example, we reduce the number of features to 2 using PCA.

After the dimensionality reduction, we had only 2 features, therefore we can plot the scatter-plot, which is easier to visualize. For example, we can clearly see the differences between the ‘ckd’ and ‘not ckd’ in the current example.

In conclusion, dimensionality reduction methods, such as PCA and Isomap, are used to reduce the dimensionality of the features to 2 or 3. Next, these 2 or 3 features can be plotted to visualize the information.

It is important that the plot should be 2D or 3D format, otherwise it is very difficult for the eyes to visualize it and interpret the information.