**Vahid Mirjalili, Data Mining Researcher**

Clustering algorithms has a wide range of applications. Several methods are proposed so far, and each method has drawbacks, for example in k-means clustering the knowledge of number of clusters is crucial, yet it doesn't provide a robust solution to datasets with clusters that have different sizes or densities.

In this article, an implementation of a novel clustering algorithm is provided, and applied to a synthetic dataset that contains multiple topologies. See the reference publication by Alex Rodriguez and Alessandro Laio.

The most important feature of this algorithm is that it can automatically find the number of clusters.

In the following sections, first we provide a clustsred dataset, then illustrate how the algorithm works. At the end, we conclude with comparison of this algorithm with other clustering algorithm such as k-means and DBScan.

A syntethic dataset is provided that can be downloded here. This dataset contains 2000 points of 2 features and a ground truth cluster. The dataset is generated by random normal and uniform numbers. Several noise points were also added. Total if 6 clusters exist.

In [1]:

```
%load_ext watermark
%watermark -a 'Vahid Mirjalili' -d -p scikit-learn,numpy,matplotlib,pyprind,prettytable -v
```

In [2]:

```
## Implementation of Density Peak Clustering
### Vahid Mirjalili
import numpy as np
import urllib
import csv
# download the data
#url = 'https://raw.githubusercontent.com/mirjalil/DataScience/master/data/synthetic.6clust.csv'
#content = urllib.urlopen(url)
#content = content.read()
# save the datafile locally
#with open('data/synthetic_data.csv', 'wb') as out:
# out.write(content)
# read data into numpy array
d = np.loadtxt("../data/synthetic_data.csv", delimiter=',')
y = d[:,2]
print(np.unique(y))
```

In [3]:

```
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,8))
for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
plt.scatter(x=d[:,0][d[:,2] == clustId],
y=d[:,1][d[:,2] == clustId],
color=color,
marker='o',
alpha=0.6
)
plt.title('Clustered Dataset', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)
plt.show()
```

In [4]:

```
import pyprind
def euclidean_dist(p, q):
return (np.sqrt((p[0]-q[0])**2 + (p[1]-q[1])**2))
def cal_density(d, dc=0.1):
n = d.shape[0]
den_arr = np.zeros(n, dtype=np.int)
for i in range(n):
for j in range(i+1, n):
if euclidean_dist(d[i,:], d[j,:]) < dc:
den_arr[i] += 1
den_arr[j] += 1
return (den_arr)
den_arr = cal_density(d[:,0:2], 0.5)
print(max(den_arr))
```

In [5]:

```
def cal_minDist2Peaks(d, den):
n = d.shape[0]
mdist2peaks = np.repeat(999, n)
max_pdist = 0 # to store the maximum pairwise distance
for i in range(n):
mdist_i = mdist2peaks[i]
for j in range(i+1, n):
dist_ij = euclidean_dist(d[i,0:2], d[j,0:2])
max_pdist = max(max_pdist, dist_ij)
if den_arr[i] < den_arr[j]:
mdist_i = min(mdist_i, dist_ij)
elif den_arr[j] <= den_arr[i]:
mdist2peaks[j] = min(mdist2peaks[j], dist_ij)
mdist2peaks[i] = mdist_i
# Update the value for the point with highest density
max_den_points = np.argwhere(mdist2peaks == 999)
print(max_den_points)
mdist2peaks[max_den_points] = max_pdist
return (mdist2peaks)
```

In [6]:

```
mdist2peaks = cal_minDist2Peaks(d, den_arr)
def plot_decisionGraph(den_arr, mdist2peaks, thresh=None):
plt.figure(figsize=(10,8))
if thresh is not None:
centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()
noncenter_points = np.argwhere((mdist2peaks < thresh) )
else:
centroids = None
noncenter_points = np.arange(den_arr.shape[0])
plt.scatter(x=den_arr[noncenter_points],
y=mdist2peaks[noncenter_points],
color='red',
marker='o',
alpha=0.5,
s=50
)
if thresh is not None:
plt.scatter(x=den_arr[centroids],
y=mdist2peaks[centroids],
color='blue',
marker='o',
alpha=0.6,
s=140
)
plt.title('Decision Graph', size=20)
plt.xlabel(r'$\rho$', size=25)
plt.ylabel(r'$\delta$', size=25)
plt.ylim(ymin=min(mdist2peaks-0.5), ymax=max(mdist2peaks+0.5))
plt.tick_params(axis='both', which='major', labelsize=18)
plt.show()
plot_decisionGraph(den_arr, mdist2peaks, 1.7)
```

As shown in the previous plot, cluster centroids stand out since they have larger distances to density peaks. These points are shown in blue, and can be picked easily:

In [7]:

```
## points with highest density and distance to other high-density points
thresh = 1.0
centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()
print(centroids.reshape(1, centroids.shape[0]))
```

In [8]:

```
%matplotlib inline
plt.figure(figsize=(10,8))
d_centers = d[centroids,:]
for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
plt.scatter(x=d[:,0][d[:,2] == clustId],
y=d[:,1][d[:,2] == clustId],
color='grey',
marker='o',
alpha=0.4
)
# plot the cluster centroids
plt.scatter(x=d_centers[:,0][d_centers[:,2] == clustId],
y=d_centers[:,1][d_centers[:,2] == clustId],
color='purple',
marker='*',
s=400,
alpha=1.0
)
plt.title('Cluster Centroids', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=0, xmax=10)
plt.ylim(ymin=0, ymax=10)
plt.show()
```

After the cluster centroids are found, the remaining points should be assigned to their corresponding centroid. A naive way is to assign each point simply to the nearest centroid. However, this method is prone to large errors if data has some weird underlying topology.

As a result, the least error prone method is to assign points to the cluster of their nearest neigbour, as given in algorithm below:

**Algorithm:**

**1. Assign centroids to a unique cluster, and remove them from the set**

**Repeat until all the points are assigned to a cluster:**

**2. Find the maximum density in the remaining set of points, and remove it from the set**

**3. Assign it to the same cluster as its nearest neighbor in the set of already assigned points**

In [12]:

```
def assign_cluster(df, den_arr, centroids):
""" Assign points to clusters
"""
nsize = den_arr.shape[0]
#print (nsize)
cmemb = np.ndarray(shape=(nsize,2), dtype='int')
cmemb[:,:] = -1
ncm = 0
for i,cix in enumerate(centroids):
cmemb[i,0] = cix # centroid index
cmemb[i,1] = i # cluster index
ncm += 1
da = np.delete(den_arr, centroids)
inxsort = np.argsort(da)
for i in range(da.shape[0]-1, -1, -1):
ix = inxsort[i]
dist = np.repeat(999.9, ncm)
for j in range(ncm):
dist[j] = euclidean_dist(df[ix], df[cmemb[j,0]])
#print(j, ix, cmemb[j,0], dist[j])
nearest_nieghb = np.argmin(dist)
cmemb[ncm,0] = ix
cmemb[ncm,1] = cmemb[nearest_nieghb, 1]
ncm += 1
return(cmemb)
```

In [10]:

```
clust_membership = assign_cluster(d, den_arr, centroids)
```

In [11]:

```
plt.figure(figsize=(10,8))
for clustId,color in zip(range(0,6),('blue', 'red', 'green', 'purple', 'orange', 'brown')):
cset = clust_membership[clust_membership[:,1] == clustId,0]
plt.scatter(x=d[cset,0],
y=d[cset,1],
color=color,
marker='o',
alpha=0.6
)
plt.title('Clustered Dataset by Density-Peak Algorithm', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)
plt.show()
```

The new clustering algorithm presented here, is capable of clustering dataset that contain

- noise points
- groups of different shape/topology/density

It is important to note that this method needs minimum number of parameters, which is eps for finding the density of each point. The number of clusters is found by analyzing the density peaks and minimum distances to density peaks. The application of this clustering algorithm is not limited to numeric data types, as it could also be used for text/string data types provided that a proper distance function exist.