10. Clustering with dimensionality reduction

10.1. Introduction

In previous chapters, we saw the examples of ‘clustering Chapter 6’, ‘dimensionality reduction (Chapter 7 and Chapter 8)’, and ‘preprocessing (Chapter 8)’. Further, in Chapter 8, the performance of the dimensionality reduction technique (i.e. PCA) is significantly improved using the preprocessing of data.

Remember, in Chapter 7 we used the PCA model to reduce the dimensionality of the features to 2, so that a 2D plot can be plotted, which is easy to visualize. In this chapter, we will combine these three techniques together, so that we can get much information from the scatter plot.

Note

In this chapter, we will use a ‘whole sale customer’ dataset, which is available at UCI Repository.

Our aim is to cluster the data so that we can see the products, which are bought by the customer together. For example, if a person went to shop to buy some grocery, then is is quite likely that he will but the ‘milk’ as well, therefore we can put the ‘milk’ near the grocery items; similarly it is quite unlikely that the same person will buy the fresh vegetables at the same time.

If we can predict such behavior of the customer, then we can arrange the shop accordingly, which will increase the sell of the items. In this chapter, we will do the same.

10.2. Read and clean the data

  • First the the dataset and drop the columns which have “Null” values,
# whole_sale.py

import pandas as pd

df = pd.read_csv('data/Wholesale customers data.csv')
print(df.isnull().sum()) # print the sum of null values
  • Following is the output of above code. Note that there is no ‘Null’ value, therefore we need not to drop anything.
$ python whole_sale.py
Channel             0
Region              0
Fresh               0
Milk                0
Grocery             0
Frozen              0
Detergents_Paper    0
Delicatessen        0
dtype: int64
  • Next, our aim is to find the buying-patterns of the customers, therefore we do not need the columns ‘Channel’ and ‘Region’ for this analysis. Hence we will drop these two columns,
1
2
3
4
5
6
7
8
9
# whole_sale.py

import pandas as pd

df = pd.read_csv('data/Wholesale customers data.csv')
print(df.isnull().sum()) # print the sum of null values

df = df.drop(labels=['Channel', 'Region'], axis=1)
# print(df.head())

10.3. Clustering using KMean

  • Now perform the clustering as below. Note that, the ‘Normalizer()’ is used at Line 14 for the preprocessing. We can try the different preprocessing-methods as well, to visualize the outputs.

Note

After completing the chapter, try following as well and see the outputs,

  • Use different ‘preprocessing’ methods e.g ‘MaxAbsScaler’ and ‘StandardScaler’ etc. and see the performance of the code.
  • Use different values of n_clusters e.g 2, 3 and 4 etc.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# whole_sale.py

import pandas as pd
from sklearn import preprocessing
from sklearn.cluster import KMeans

df = pd.read_csv('data/Wholesale customers data.csv')
# print(df.isnull().sum()) # print the sum of null values

df = df.drop(labels=['Channel', 'Region'], axis=1)
# print(df.head())

# preprocessing
T = preprocessing.Normalizer().fit_transform(df)

# change n_clusters to 2, 3 and 4 etc. to see the output patterns
n_clusters = 3 # number of cluster

# Clustering using KMeans
kmean_model = KMeans(n_clusters=n_clusters)
kmean_model.fit(T)
centroids, labels = kmean_model.cluster_centers_, kmean_model.labels_
# print(centroids)
# print(labels)

10.4. Dimensionality reduction

Now, we will perform the dimensionality reduction using PCA. We will reduce the dimensions to 2.

Important

  • Currently, we are performing the clustering first and then dimensionality reduction as we have few features in this example.
  • If we have a very large number of features, then it is better to perform dimensionality reduction first and then use the clustering algorithm e.g. KMeans.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# whole_sale.py

import pandas as pd
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

df = pd.read_csv('data/Wholesale customers data.csv')
# print(df.isnull().sum()) # print the sum of null values

df = df.drop(labels=['Channel', 'Region'], axis=1)
# print(df.head())

# preprocessing
T = preprocessing.Normalizer().fit_transform(df)

# change n_clusters to 2, 3 and 4 etc. to see the output patterns
n_clusters = 3 # number of cluster

# Clustering using KMeans
kmean_model = KMeans(n_clusters=n_clusters)
kmean_model.fit(T)
centroids, labels = kmean_model.cluster_centers_, kmean_model.labels_
# print(centroids)
# print(labels)

# Dimesionality reduction to 2
pca_model = PCA(n_components=2)
pca_model.fit(T) # fit the model
T = pca_model.transform(T) # transform the 'normalized model'
# transform the 'centroids of KMean'
centroid_pca = pca_model.transform(centroids)
# print(centroid_pca)

10.5. Plot the results

Finally plot the results as below. The scatter plot is shown in Fig. 10.1.

  • Lines 36-39 assign colors to each ‘label’, which are generated by KMeans at Line 24.
  • Lines 41-45, plots the components of PCA model using the scatter-plot. Note that, KMeans generates 3-clusters, which are used by ‘PCA’, therefore total 3 colors are displayed by the plot.
  • Lines 47-51, plots the ‘centroids’ generated by the KMeans.
  • Line 53-66 plots the ‘features names’ along with the ‘arrows’.

Important

  • The arrows are the projection of each feature on the principle component axis. These arrows represents the level of importance of each feature in the multidimensional scaling. For example, ‘Frozen’ and ‘Fresh’ contribute more that the other features.
  • In Fig. 10.1 we can conclude that the ‘Fresh items such as fruits and vegetables’ should be places place separately; whereas ‘Grocery’, ‘Detergents_Paper’ and ‘Milk’ should be placed close to each other.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# whole_sale.py

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

df = pd.read_csv('data/Wholesale customers data.csv')
# print(df.isnull().sum()) # print the sum of null values

df = df.drop(labels=['Channel', 'Region'], axis=1)
# print(df.head())

# preprocessing
T = preprocessing.Normalizer().fit_transform(df)

# change n_clusters to 2, 3 and 4 etc. to see the output patterns
n_clusters = 3 # number of cluster

# Clustering using KMeans
kmean_model = KMeans(n_clusters=n_clusters)
kmean_model.fit(T)
centroids, labels = kmean_model.cluster_centers_, kmean_model.labels_
# print(centroids)
# print(labels)

# Dimesionality reduction to 2
pca_model = PCA(n_components=2)
pca_model.fit(T) # fit the model
T = pca_model.transform(T) # transform the 'normalized model'
# transform the 'centroids of KMean'
centroid_pca = pca_model.transform(centroids)
# print(centroid_pca)

# colors for plotting
colors = ['blue', 'red', 'green', 'orange', 'black', 'brown']
# assign a color to each features (note that we are using features as target)
features_colors = [ colors[labels[i]] for i in range(len(T)) ]

# plot the PCA components
plt.scatter(T[:, 0], T[:, 1],
            c=features_colors, marker='o',
            alpha=0.4
        )

# plot the centroids
plt.scatter(centroid_pca[:, 0], centroid_pca[:, 1],
            marker='x', s=100,
            linewidths=3, c=colors
        )

# store the values of PCA component in variable: for easy writing
xvector = pca_model.components_[0] * max(T[:,0])
yvector = pca_model.components_[1] * max(T[:,1])
columns = df.columns

# plot the 'name of individual features' along with vector length
for i in range(len(columns)):
    # plot arrows
    plt.arrow(0, 0, xvector[i], yvector[i],
                color='b', width=0.0005,
                head_width=0.02, alpha=0.75
            )
    # plot name of features
    plt.text(xvector[i], yvector[i], list(columns)[i], color='b', alpha=0.75)

plt.show()
../_images/whole_sale_plot.png

Fig. 10.1 Scatter plot for ‘Wholesale dataset’