6. Clustering

6.1. Introduction

In this chapter, we will see the examples of clustering. Lets understand the clustering with an example first. In the Listing 6.1, two lists i.e. x and y are plotted using ‘scatter’ plot. We can see that the data can be divided into three clusters as shown in Fig. 6.1.

Note

In Fig. 6.1, it is easy to see the clusters as samples are very small; but it can not visualize so easily if we have a huge number of samples, as shown in this chapter. In those cases, the machine learning approach can be quite useful.

Listing 6.1 Clusters
# cluster_ex.py

import matplotlib.pyplot as plt

x = [-3, 25, -2, 7, -1, 9]
y = [11, 66, 13, 25, 12, 27]
plt.scatter(x, y)
plt.show()
../_images/cluster_ex.png

Fig. 6.1 Clusters

6.2. KMeans

Now, we will cluster our data using “KMeans” algorithms.

  • Similar to previous chapters, first we need to transform the data in 2-dimensional format, so that is can be used by SciKit library. In the below code, the lists ‘x’ and ‘y’ are merged together, so that a ‘list of list’ will be created,
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# cluster_ex.py

import matplotlib.pyplot as plt
import numpy as np

x = [-3, 25, -2, 7, -1, 9]
y = [11, 66, 13, 25, 12, 27]
# plt.scatter(x, y)
# plt.show()

# convert list into array based on columns
data = np.column_stack((x, y))
print(data)
$ python cluster_ex.py
[[-3 11]
 [25 66]
 [-2 13]
 [ 7 25]
 [-1 12]
 [ 9 27]]
  • Now, we can use the “KMeans” algorithm to the transformed data as shown in Listing 6.2. The clusters generated by the algorithm is shown in Fig. 6.2.

Note

  • Centroids are the location of mean points generated by KMeans algorithm, which can be generated using ‘cluster_centers_’.
  • Also, each points can be assigned a label using ‘labels_’. Note that, once we get the labels, then we can use supervised learning for further analysis.
  • Number of samples should be higher than the number of clusters. For example, currently we have 6 samples, if we use “n_clusters=7”, then error will be generated.
  • We should increase the value of “n_clusters” to remove the outliers from the clustering. For example, in the current dataset, the points location i.e. [25, 66] can be seen as outliers i.e. it may be in the dataset due to measurement error or noise. Since, it is present in the dataset, it will affect the final locations of clusters. In the other words, if we put “n_clusters=2”, then one cluster will locate the point [25, 66], and second cluster will take the mean the of rest of the points, which may not be desirable, therefore, we need to decide the value of “n_clusters” according to dataset.
Listing 6.2 Clusters generated by KMeans algorithm
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# cluster_ex.py

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans


x = [-3, 25, -2, 7, -1, 9]
y = [11, 66, 13, 25, 12, 27]
# plt.scatter(x, y)
# plt.show()

# convert list into array based on columns
data = np.column_stack((x, y))
# print(data)

model = KMeans(n_clusters=3) # separate data in 3 clusters
model.fit(data)
model.predict(data)
# model.fit_predict(data) # combine above two steps in one

# locations of the means generated by the KMeans
centroids = model.cluster_centers_
print("Centroids:\n", centroids)

# each sample is labelled as well
targets = model.labels_
print("Targets or Lables:\n", targets)

# plot the data
plt.scatter(x, y)
plt.scatter(x = centroids[:, 0], y = centroids[:, 1], marker='x')
plt.show()
$ python cluster_ex.py

Centroids:
 [[ -2.  12.]
 [ 25.  66.]
 [  8.  26.]]

Targets or Lables:
 [0 1 0 2 0 2]
../_images/cluster_kmeans_ex.png

Fig. 6.2 Clusters generated by KMeans algorithm

Tip

KMeans algorithm should be used for the number of samples less than 10000. If there are more than 10000 samples, then MiniBatchKMeans algorithm must be used, which converge faster than the KMeans, but the quality of the results may reduce.