14. Quick reference guide

14.1. Introduction

In previous chapters, we saw several examples of machine learning methods. In this chapter, we will summarize those methods along with several other useful ways to analyze the data.

14.2. Understand the data

When we get the data, we need to see the data and it’s statistics. Then we need to perform certain clean/transform operations e.g. filling the null values etc. In this section, we will see several steps which may be useful to understand the data,

14.2.1. Load the data and add headers

Although we can use Python or Numpy to load the data, but it is better to use Pandas library to load the data.

  • Add header to data : In the below code, the first 29 rows are skipped as these lines do not contain samples but the information about the each sample.
>>> import pandas as pd
>>>
>>> # create header for dataset
... header = ['age','bp','sg','al','su','rbc','pc','pcc',
...     'ba','bgr','bu','sc','sod','pot','hemo','pcv',
...     'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
...     'classification']
>>>
>>> # read the dataset
... df_kidney = pd.read_csv("data/chronic_kidney_disease.arff",
...         header=None, # use header=0 to replace the existing header
...         skiprows=29, # skip first 29 rows
...         names=header
...        )
>>>
>>> df_kidney.shape # shape of data : 400 rows and 25 columns
(400, 25)
  • Replace existing header from the data
>>> import pandas as pd
>>> # new headers
... header = ["channel", "area", "fresh", "milk", "grocery",
...         "frozen", "detergent", "delicatessen"]
>>>
>>> # replace existing headers
... df_whole_sale = pd.read_csv("data/Wholesale customers data.csv",
...                 header=0, # replace existing header; use this or below
...                 # skiprows=1, # skip the first row i.e. header
...                 names=header # use new header
...             )
>>>
>>> df_whole_sale.shape # shape of data: 440 rows and 8 columns
(440, 8)
>>>
>>> df_whole_sale.head(3) # show first three rows
   channel  area  fresh  milk  grocery  frozen  detergent  delicatessen
0        2     3  12669  9656     7561     214       2674          1338
1        2     3   7057  9810     9568    1762       3293          1776
2        2     3   6353  8808     7684    2405       3516          7844
>>>
>>> df_whole_sale.tail(2) # show last two rows
     channel  area  fresh  milk  grocery  frozen  detergent  delicatessen
438        1     3  10290  1981     2232    1038        168          2125
439        1     3   2787  1698     2510      65        477            52

14.2.2. Check for the null values

  • Check if the null value exist,
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>> df_kidney.isnull().sum()
age               0
bp                0
sg                0
al                0
su                0
rbc               0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sc                0
sod               0
pot               0
hemo              0
pcv               0
wbcc              0
rbcc              0
htn               0
dm                1
cad               0
appet             0
pe                0
ane               0
classification    0
dtype: int64
  • Check the location of null value
>>> df_kidney[df_kidney.dm.isnull()]
    age  bp     sg al su     rbc      pc         pcc          ba  bgr  \
369  75  70  1.020  0  0  normal  normal  notpresent  notpresent  107

         ...       pcv   wbcc rbcc htn   dm cad appet    pe ane classification
369      ...        46  10300  4.8  no  NaN  no    no  good  no             no

[1 rows x 25 columns]
>>>
>>> df_kidney[df_kidney.dm.isnull()].iloc[:, 0:2] # display only two columns
    age  bp
369  75  70

14.2.3. Check the data types

Sometimes the datatypes are not correctly read by the Pandas, therefore it is better to check the data types of each columns.

  • In the below results are all the types are ‘object’ (not numeric), because samples have ‘?’ in it, therefore we need to replace the ‘?’ values with some other values,
>>> df_kidney.dtypes
age               object
bp                object
sg                object
al                object
su                object
rbc               object
pc                object
pcc               object
ba                object
bgr               object
bu                object
sc                object
sod               object
pot               object
hemo              object
pcv               object
wbcc              object
rbcc              object
htn               object
dm                object
cad               object
appet             object
pe                object
ane               object
classification    object
dtype: object
  • If we perform the ‘conversion’ operation at this moment, then error will be generate due to ‘?’ in the data,
>>> df_kidney.bgr = pd.to_numeric(df_kidney.bgr)
Traceback (most recent call last):
ValueError: Unable to parse string "?" at position 1
  • Replace the ‘?’ with ‘NaN’ value using ‘replace’ command; and change the ‘type’ of ‘bgr’ column,
>>> import numpy as np
>>> df_kidney = df_kidney.replace('?', np.nan)
>>> df_kidney.bgr = pd.to_numeric(df_kidney.bgr)
>>> df_kidney.dtypes
[...]
ba                 object
bgr               float64
[...]
classification     object
dtype: object
  • Next, we can drop or fill the ‘NaN’ values. In the below code we dropped the NaN values,
>>> df_kidney.isnull().sum() # check the NaN
age                 9
bp                 12
sg                 47
[...]
classification      0
dtype: int64
>>>
>>> # drop the NaN
>>> df_kidney = df_kidney.dropna(axis=0, how="any")
>>>
>>> df_kidney.isnull().sum() # check NaN again
age               0
bp                0
sg                0
[...]
classification    0
dtype: int64

14.2.4. Statistics of the data

  • The ‘describe’ can be used to see the statistics of the data.
>>> df_whole_sale.describe()
          channel        area          fresh          milk       grocery  \
count  440.000000  440.000000     440.000000    440.000000    440.000000
mean     1.322727    2.543182   12000.297727   5796.265909   7951.277273
std      0.468052    0.774272   12647.328865   7380.377175   9503.162829
min      1.000000    1.000000       3.000000     55.000000      3.000000
25%      1.000000    2.000000    3127.750000   1533.000000   2153.000000
50%      1.000000    3.000000    8504.000000   3627.000000   4755.500000
75%      2.000000    3.000000   16933.750000   7190.250000  10655.750000
max      2.000000    3.000000  112151.000000  73498.000000  92780.000000

             frozen     detergent  delicatessen
count    440.000000    440.000000    440.000000
mean    3071.931818   2881.493182   1524.870455
std     4854.673333   4767.854448   2820.105937
min       25.000000      3.000000      3.000000
25%      742.250000    256.750000    408.250000
50%     1526.000000    816.500000    965.500000
75%     3554.250000   3922.000000   1820.250000
max    60869.000000  40827.000000  47943.000000
  • See output of first 2 columns only,
>>> df_whole_sale.iloc[:, 0:2].describe()
          channel        area
count  440.000000  440.000000
mean     1.322727    2.543182
std      0.468052    0.774272
min      1.000000    1.000000
25%      1.000000    2.000000
50%      1.000000    3.000000
75%      2.000000    3.000000
max      2.000000    3.000000


>>> df_whole_sale.describe().iloc[:,0:2]
          channel      area
count  440.000000  440.000000
mean     1.322727    2.543182
std      0.468052    0.774272
min      1.000000    1.000000
25%      1.000000    2.000000
50%      1.000000    3.000000
75%      2.000000    3.000000
max      2.000000    3.000000
  • Display the output of specific-columns,
>>> df_whole_sale[['milk', 'fresh']].describe()
               milk          fresh
count    440.000000     440.000000
mean    5796.265909   12000.297727
std     7380.377175   12647.328865
min       55.000000       3.000000
25%     1533.000000    3127.750000
50%     3627.000000    8504.000000
75%     7190.250000   16933.750000
max    73498.000000  112151.00000

14.2.5. Output distribution for classification problem

It is better to see the distributions of the outputs for the classification problem. In the below output, we can see that we have more data for ‘no chronic kidney disease (nockd)’ than ‘chronic kidney disease (ckd)’,

>>> df_kidney.groupby("classification").size()
classification
ckd        43
notckd    114
dtype: int64

14.2.6. Correlation between features

It is also good see the correlation between the features. In the below results we can see that the correlation of ‘milk’ is higher with ‘grocery’ and ‘detergent’, which indicates that customers who are buying ‘milk’ are more likely to buy ‘grocery’ and ‘detergent’ as well. See Chapter 10 for more details about this relationship.

>>> df_whole_sale[['fresh', 'milk', 'grocery', 'frozen',
... 'detergent', 'delicatessen']].corr()
                 fresh      milk   grocery    frozen  detergent  delicatessen
fresh         1.000000  0.100510 -0.011854  0.345881  -0.101953      0.244690
milk          0.100510  1.000000  0.728335  0.123994   0.661816      0.406368
grocery      -0.011854  0.728335  1.000000 -0.040193   0.924641      0.205497
frozen        0.345881  0.123994 -0.040193  1.000000  -0.131525      0.390947
detergent    -0.101953  0.661816  0.924641 -0.131525   1.000000      0.069291
delicatessen  0.244690  0.406368  0.205497  0.390947   0.069291      1.000000

14.3. Visualizing the data

In the tutorial, we already saw several data-visualization techniques such as ‘histogram’ and ‘scatter plot’ etc. In this section, we will summarize these techniques.

Table 14.1 Types of plots
Type Example
Univariate Histogram, Density plot, Box and Whisker plot
Multivariate Scatter plot , Correlation matrix plot

The plots can be divided into two categories as shown in Table 14.1. These plots are described below,

14.3.1. Univariate plots

The univariate plots are the plots which are used to visualize the data independently. In this section we will some of the important univariate plots,

14.3.1.1. Histogram

Histograms are the quickest way to visualize the distributions of the data as shown below,

>>> import matplotlib.pyplot as plt
>>> df_whole_sale.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0xa7d6af4c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa7aa0c2c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa7a4d6cc>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa7a1038c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa79c85ac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa79c85ec>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa798b96c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa796912c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa78b754c>]], dtype=object)
>>> plt.show()
../_images/hist_wh_sale.png

Fig. 14.1 Histogram of wholesale data

14.3.1.2. Density Plots

Density plots can be seen as smoothed Histogram as shown below,

>>> df_whole_sale.plot(kind='density', sharex=False, subplots=True, layout=(3,3))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0xa8e00eec>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa794c6cc>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa7f1aa6c>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa7a2acac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa8c23b4c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa8c2342c>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa7ad8aac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa57ad5cc>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa8dd364c>]], dtype=object)
>>> plt.show()
../_images/density_wh_sale.png

Fig. 14.2 Density plot of wholesale data

14.3.1.3. Box and Whisker plot

Box and Whisker plots draws a line at the median-value and a box around the 25th and 75th percentiles.

>>> df_whole_sale.plot(kind='box', sharex=False, subplots=True, layout=(3,3))
channel            Axes(0.125,0.653529;0.227941x0.226471)
area            Axes(0.398529,0.653529;0.227941x0.226471)
fresh           Axes(0.672059,0.653529;0.227941x0.226471)
milk               Axes(0.125,0.381765;0.227941x0.226471)
grocery         Axes(0.398529,0.381765;0.227941x0.226471)
frozen          Axes(0.672059,0.381765;0.227941x0.226471)
detergent              Axes(0.125,0.11;0.227941x0.226471)
delicatessen        Axes(0.398529,0.11;0.227941x0.226471)
dtype: object
>>> plt.show()
../_images/box_wh_sale.png

Fig. 14.3 Box and Whisker plot of wholesale data

14.3.2. Multivariate plots:

The multivariate plots are the plots which are used to visualize the relationship between two or more data.

14.3.2.1. Scatter plot

Important

Note that we need to convert the numpy-array into Pandas DataFrame for plotting it using Pandas. This is applicable to both ‘univariate’ and ‘multivariate’ plots

  • Below is the code to convert the ‘numpy array’ into ‘DataFrame’,
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> features, targets = iris.data, iris.target
>>> type(features)
<class 'numpy.ndarray'>
>>>
>>> import pandas as pd
>>> df_features = pd.DataFrame(features) # convert to DataFrame
>>> type(df_features)
<class 'pandas.core.frame.DataFrame'>
>>> df_features.head()
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
  • Now, we can plot the scatter-plot as below,
>>> from pandas.plotting import scatter_matrix
>>> scatter_matrix(df_features)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0xa8166a6c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa747948c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa7437f2c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa745d08c>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa73ac44c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa73ac48c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa73b842c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa7353acc>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa73126ac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa72c6b2c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa728bd2c>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa723d32c>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0xa71fe6ac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa71b07ec>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa71d24ec>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xa71243cc>]],
        dtype=object)
>>>  plt.show()
../_images/scat_iris_data.png

Fig. 14.4 Scatter plot for iris data

Note

We can plot the multicolor ‘Scatter plot’ and ‘Histogram’ as show in Section 12.2, which is easier to visualize as compare to single color plots.

For colorful scatter_matrix plot, we can use below code,

>>> scatter_matrix(df_features, c=iris.target) # colorful scatter plot

14.3.2.2. Correlation matrix plot

  • Below is the code, which plots the correlation values of the data, which is known as correlation-matrix plot,
>>> corr_whole_sale = df_whole_sale[['fresh', 'milk', 'grocery', 'frozen',
... 'detergent', 'delicatessen']].corr()
>>> plt.matshow(corr_whole_sale)
<matplotlib.image.AxesImage object at 0xa697f64c>
>>> plt.show()
../_images/cor_mat_wh_sale.png

Fig. 14.5 Correlation-matrix plot for the wholesale data

  • Also, we can add ‘colorbar’ to see the relationship between the color and the correlation values,
>>> plt.matshow(corr_whole_sale, vmin=-1, vmax=1)
<matplotlib.image.AxesImage object at 0xa5d9270c>
>>> plt.colorbar()
<matplotlib.colorbar.Colorbar object at 0xa5d928ec>
>>> plt.show()
../_images/cor_mat_wh_sale2.png

Fig. 14.6 Correlation-matrix plot with ‘colorbar’ for the wholesale data

  • Finally, we can add ‘headers’ to the plot so that it will more readable. Below is the complete code for plotting the data,
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> # new headers
... header = ["channel", "area", "fresh", "milk", "grocery",
...         "frozen", "detergent", "delicatessen"]
>>>
>>> # replace existing headers
... df_whole_sale = pd.read_csv("data/Wholesale customers data.csv",
...                 header=0, # replace existing header; use this or below
...                 # skiprows=1, # skip the first row i.e. header
...                 names=header # use new header
...             )
>>>
>>>
>>> names = ['fresh', 'milk', 'grocery', 'frozen', 'detergent', 'delicatessen']
>>> corr_whole_sale = df_whole_sale[names].corr()
>>>
>>> # plot the data
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> corr_plot = ax.matshow(corr_whole_sale, vmin=-1, vmax=1)
>>> fig.colorbar(corr_plot)
>>> ticks = np.arange(0,6,1) # total 6 items
>>> ax.set_xticks(ticks)
>>> ax.set_yticks(ticks)
>>> ax.set_xticklabels(names)
>>> ax.set_yticklabels(names)
>>> plt.show()
../_images/cor_mat_wh_sale3.png

Fig. 14.7 Correlation-matrix plot with ‘colorbar’ and ‘tick-name’ for the wholesale data

Note

From the correlation-matrix plot it is quite clear that the people are buying the ‘grocery’ and ‘detergent’ together.

See Chapter 10 for more details about these relationships, where scatter plot is used to visualize the relationships.

14.4. Preprocessing of the data

In Chapter 8, we saw the examples of preprocessing of the data and saw the performance improvement in the model. Further, we learn that the some of the algorithm are sensitive to statistics of the features, e.g. PCA algorithm gives more weight age to the feature which has high variances. In the other words, the feature with high variance will dominate the performance of the PCA. In this section, we will summarize some of the preprocessing methods.

14.4.1. Statistics of data

  • Let’s read the samples from the ‘Whole sale data’ first, and we will preprocess this data in this section,
>>> import pandas as pd
>>>
>>> # new headers
... header = ["channel", "area", "fresh", "milk", "grocery",
...         "frozen", "detergent", "delicatessen"]
>>>
>>> # replace existing headers
... df_whole_sale = pd.read_csv("data/Wholesale customers data.csv",
...                 header=0, # replace existing header; use this or below
...                 # skiprows=1, # skip the first row i.e. header
...                 names=header # use new header
...             )
  • Next see the mean and variance of the each features,
>>> # mean and variance
... import numpy as np
>>> np.mean(df_whole_sale)
channel             1.322727
area                2.543182
fresh           12000.297727
milk             5796.265909
grocery          7951.277273
frozen           3071.931818
detergent        2881.493182
delicatessen     1524.870455
dtype: float64
>>>
>>> np.var(df_whole_sale)
channel         2.185744e-01
area            5.981353e-01
fresh           1.595914e+08
milk            5.434617e+07
grocery         9.010485e+07
frozen          2.351429e+07
detergent       2.268077e+07
delicatessen    7.934923e+06
dtype: float64

14.4.2. StandardScaler

We used the ‘StandardScaler’ in Chapter 8 and saw the performance improvement in the model with it. It sets the ‘mean = 0’ and ‘variance = 1’ for all the features,

  • Now, process the data using StandardScaler,
>>> # preprocessing StandardScaler : mean=0, var=1
... from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(df_whole_sale)
>>> df_temp = scaler.transform(df_whole_sale)

Also, we can combine the above two steps (i.e. fit and transform) into one step as below,

>>> # preprocessing StandardScaler : mean=0, var=1
... from sklearn.preprocessing import StandardScaler
>>> df_temp = StandardScaler().fit_transform(df_whole_sale)
  • Note that the type of the ‘df_temp’ is ‘numpy.ndarray’, therefore we need to loop through each column to calculate mean and variance as shown below,
>>> type(df_temp) # numpy array
<class 'numpy.ndarray'>
>>>
>>> # mean and var of each column
... for i in range(df_temp.shape[1]):
...     print("row {0}: mean={1:<5.2f} var={2:<5.2f}".format(i,
...         np.mean(df_temp[:,i]),
...         np.var(df_temp[:,i])
...         )
...     )
...
row 0: mean=0.00  var=1.00
row 1: mean=0.00  var=1.00
row 2: mean=-0.00 var=1.00
row 3: mean=-0.00 var=1.00
row 4: mean=-0.00 var=1.00
row 5: mean=0.00  var=1.00
row 6: mean=0.00  var=1.00
row 7: mean=-0.00 var=1.00
  • Also, we can convert the numpy-array to Pandas-DataFrame and then calculate the mean and variance,
>>> # convert numpy-array to Pandas-dataframe
... df = pd.DataFrame(df_temp, columns=header)
>>>
>>> type(df) # Pandas DataFrame
<class 'pandas.core.frame.DataFrame'>
>>>
>>> np.mean(df) # mean = 0
channel        -2.523234e-18
area            2.828545e-16
fresh          -3.727684e-17
milk           -8.815549e-18
grocery        -5.197665e-17
frozen          3.587724e-17
detergent       2.618250e-17
delicatessen   -2.508450e-18
dtype: float64
>>>
>>> np.var(df)
channel         1.0
area            1.0
fresh           1.0
milk            1.0
grocery         1.0
frozen          1.0
detergent       1.0
delicatessen    1.0
dtype: float64

14.4.3. MinMax scaler

MinMax scaler scales the features in the range (0 to 1) i.e. minimum and maximum values are scaled to 0 and 1 respectively.

>>> from sklearn.preprocessing import MinMaxScaler
>>> df_temp = MinMaxScaler().fit_transform(df_whole_sale)
>>> df = pd.DataFrame(df_temp, columns=header)
>>> np.min(df)
channel         0.0
area            0.0
fresh           0.0
milk            0.0
grocery         0.0
frozen          0.0
detergent       0.0
delicatessen    0.0
dtype: float64
>>> np.max(df)
channel         1.0
area            1.0
fresh           1.0
milk            1.0
grocery         1.0
frozen          1.0
detergent       1.0
delicatessen    1.0
dtype: float64

14.4.4. Normalizer

Normalizer process the row such that the sum of each row is ‘1’, as shown in below code,

>>> from sklearn.preprocessing import Normalizer
>>> df_temp = Normalizer().fit_transform(df_whole_sale)
>>> df = pd.DataFrame(df_temp, columns=header)

>>> # check the sum of each row
>>> for i in range(df_temp.shape[0]):
...     print("row {0}:  sum={1:0.2f}".format(
...             i, # row number
...             np.sqrt(np.cumsum(df_temp[i,:]**2)[-1])
...         )
...     )
...
row 0:  sum=1.00
row 1:  sum=1.00
row 2:  sum=1.00
row 3:  sum=1.00
row 4:  sum=1.00
row 5:  sum=1.00
[...]

14.5. Feature selection

In Chatper 7, we saw an example of feature selection, where the PCA analysis is done to reduce the dimension of the features.

Note

While collecting the data, our aim is to collect the data without thinking the relationship between the ‘features’ and the ‘targets’. It is possible that some of these data has no impact on the target e.g. ‘First name’ of the person has no relationship with the ‘chronic kidney disease’. If we use this feature, i.e. First name, to predict the ‘chronic kidney disease’, then we will have the wrong results.

Feature selection is the process of ‘removing’ or ‘giving less weight’ to irrelevant or partially relevant features. In this way we can achieve following,

  1. Reduce overfitting: as the partially relevant data is removed from the dataset.
  2. Reduce training time: as we have less features after feature selection.

14.5.1. SelectKBest

The ‘SelectKBest’ class can be used to find the best ‘K’ features from the dataset. In the below code, the ‘new_features’ contains the last two columns of the ‘features’,

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> features, targets = iris.data, iris.target
>>>
>>> from sklearn.feature_selection import SelectKBest
>>> selector = SelectKBest(k=2)
>>> selector.fit(features, targets)
SelectKBest(k=2, score_func=<function f_classif at 0xb3cd49bc>)
>>> new_features = selector.transform(features)
>>> print(new_features[0:5, :]) # selected last 2 columns
[[ 1.4  0.2]
 [ 1.4  0.2]
 [ 1.3  0.2]
 [ 1.5  0.2]
 [ 1.4  0.2]]
>>> print(features[0:5, :])
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

14.5.2. Recursive Feature Elimination (RFE)

RFE recursively checks the accuracy of the model and removes attributes which result in lower accuracy,

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> features, targets = iris.data, iris.target
>>>
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.feature_selection import RFE
>>> model = LogisticRegression()
>>> selector = RFE(model, 2)
>>> fit = selector.fit(features, targets)
>>> new_features = fit.transform(features)
>>> print(new_features[0:5, :]) # selected 2nd and 4th column
[[ 3.5  0.2]
 [ 3.   0.2]
 [ 3.2  0.2]
 [ 3.1  0.2]
 [ 3.6  0.2]]
>>> print(features[0:5, :])
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

14.5.3. Principal component analysis (PCA)

Please see the Chatper 7 where PCA is discussed in detail. Note that, it does not select the features but transform the features.

14.6. Algorithms

In this section, we will see some of the widely use algorithms for the ‘classification’ and ‘regression’ problems.

Important

Note that all the models do not work well in all the cases. Therefore, we need to check the performance of various machine learning algorithms before finalizing the model.

14.6.1. Classification algorithms

Table 14.2 shows some of the widely used classification algorithms. We already see the examples of ‘Logistic Regression (Chapter 3)’, ‘K-nearest neighbor (Chapter 2)’ and ‘SVM (Chapter 11)’. In this section we will discuss LDA, Naive Bayes and Regression tree algorithms.

Table 14.2 Classification algorithms
Type Algorithm
Linear Logistic Regression, Linear Discriminant Analysis (LDA)
Non-linear

K-nearest neighbor, Support vector machines (SVM),

Naive Bayes, Decision Tree

14.6.1.1. Linear Discriminant Analysis (LDA)

The below code is same as Listing 3.4 but LDA is used instead of ‘K-nearest’ and ‘LogisticRegression’ algorithms,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# rock_mine2.py

# 'R': Rock, 'M': Mine

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

f = open("data/sonar.all-data", 'r')
data = f.read()
f.close()

data = data.split() # split on \n

# save data as list i.e. list of list will be created
data_list = []
for d in data:
    # split on comma
    row = d.split(",")
    data_list.append(row)

# extract targets
row_sample, col_sample = len(data_list), len(data_list[0])

# features : last column i.e. target value will be removed form the dataset
features = np.zeros((row_sample, col_sample-1), float)
# target : store only last column
targets = []  # targets are 'R' and 'M'

for i, data in enumerate(data_list):
    targets.append(data[-1])
    features[i] = data[:-1]
# print(targets)
# print(features)

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        stratify=targets
    )

# select classifier
classifier = LinearDiscriminantAnalysis()

# training using 'training data'
classifier.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'training data'
prediction_training_targets = classifier.predict(train_features)
self_accuracy = accuracy_score(train_targets, prediction_training_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = classifier.predict(test_features)
test_accuracy = accuracy_score(test_targets, prediction_test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the results for above code,
$ python rock_mine2.py
Accuracy for training data (self accuracy): 0.885542168675
Accuracy for test data: 0.809523809524

Note

Both LogisticRegression and LinearDiscriminantAnalysis algorithms assume that input features have Gaussian distributions.

14.6.1.2. Naive Bayes

It assumes that all the features are independent of each other and have Gaussian distribution. Below is the example of the Naive Bayes algorithm,

# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# select classifier
classifier = GaussianNB()

# cross-validation
scores = cross_val_score(classifier, features, targets, cv=3)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
  • Below is the results for above code,
$ python multiclass_ex.py
Cross validation scores: [ 0.92156863  0.90196078  0.97916667]
Mean score: 0.934232026144

14.6.1.3. Decision Tree Classifier

It creates a binary decision tree from the training data to minimize the cost function,

# multiclass_ex.py

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


# create object of class 'load_iris'
iris = load_iris()

# save features and targets from the 'iris'
features, targets = iris.data, iris.target

# select classifier
classifier = DecisionTreeClassifier()

# cross-validation
scores = cross_val_score(classifier, features, targets, cv=3)
print("Cross validation scores:", scores)
print("Mean score:", np.mean(scores))
  • Below is the output for above code,
$ python multiclass_ex.py
Cross validation scores: [ 0.98039216  0.92156863  1.]
Mean score: 0.96732026143

14.6.2. Regression algorithms

Table 14.3 shows some of the widely used regression algorithms. We already see the examples of ‘Linear regression (Chapter 3)’. Also we saw the examples of ‘K-nearest neighbor (Chapter 2)’, ‘SVM (Chapter 11)’ and Decision Tree (Section 14.6.1.3) for ‘classification problems; in this section we will use these algorithms for regression problems. Further, we will discuss ‘Ridge’, ‘LASSO’ and ‘Elastic-net’ algorithms.

Table 14.3 Regression algorithms
Type Algorithm
Linear Linear regression, Ridge, LASSO, Elastic-net
Non-linear K-nearest neighbor, Support vector machines (SVM), Decision Tree

14.6.2.1. Ridge regression

It is the extended version of the Linear regression, where the ridge coefficients minimize a penalized residual sum of square known as L2 norm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = Ridge()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.82273039102

14.6.2.2. LASSO regression

It is the extended version of the Linear regression, where the ridge coefficients minimize the sum of absolute values which is known as L1 norm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = Lasso()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.70974672729

14.6.2.3. Elastic-net regression

It minimizes both the L1 norm and L2 norm,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = ElasticNet()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.744348295083

14.6.2.4. Support vector machines (SVM)

Note

Note that SVR is used for regression problem, whereas SVC was used in classification problem. Same is applicable for ‘Decision tree’ and ‘K-nearest neighbor’ algorithms.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = SVR()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.961088256595

14.6.2.5. Decision tree regression

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = DecisionTreeRegressor()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.991442971888

14.6.2.6. K-nearest neighbor regression

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# regression_ex.py

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)

# convert features in 2D format i.e. list of list
features = x[:, np.newaxis]

# save sine wave in variable 'targets'
targets = sine_wave

# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = KNeighborsRegressor()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
  • Below is the output for above code,
$ python regression_ex.py
Accuracy for test data: 0.991613506388