Research VI) Cluster analysis

Research VI) Cluster analysis

Cluster anlysis or clustering is the task of grouping a set of objects in such a way that objects in the same group – that is the cluster – are more similar to each other than to those in other groups, in some sense or another.

By this definition alone, it’s easy to understand how a proper singular definition of what cluster analysis is isn’t easy to find. What’s sure, is that clustering is a main task of the computing process called data mining, a process that’s able to discover patterns in large data sets.

The main goal of the data mining process is to extract info from a data set and transform it into an understandable structure for further use.

A cluster analysis itself isn’t one specific algorithm, but the “general task to be solved“, so it can be achieved by various algorithms that may differ in their notions and how to efficiently work. Clustering can be formulated as a multi-objective optimization problem. Depending on the individual data set and the intended use of the results, an appropriate clustering algorithm is selected, including the parameter settings and values such as the distance function.

Moreover, cluster analysis isn’t an automatic task, but instead an interactive process. As said before, defining what a cluster is is difficult, because there’re a lot of different conditions for which a cluster can be defined depending on the situaton. Although, there’s at least a common denominator: a cluster is a group of data objects.

So “clustering” is essentially a set of clusters usually containing all objects in the same data set. Clustering can be roughly distinguished as:

  • hard clustering: each object belongs to a cluster or not.
    also known as non-fuzzy clustering. Here data is divided into distinct clusters, where each data point can only belong to exactly one cluster. On the other hand, in soft clustering data points can potentially belong to multiple clusters.
  • soft clustering: each object belongs to each cluster to a certain degree.
    also known as fuzzy clustering. Clusters are identified via similarity measures. These similarity measures include distance, connectivity, and intensity. Different similarity measures may be chosen based on the data or the application.

Clustering algorithm can be categorized based on their cluster model. Unluckily, there’s no objectively one correct algorithm for each cluster analysis. It often needs to be chosen experimentally, unless there’s a mathematical reason to prefer one algorithm to another.
To give some examples, K-means and Fuzzy C-Means are two widely used clustering algorithms each used for hard clustering and soft clustering.

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm. Given an initial set of k means m1(1),…,mk(1)  the algorithm proceeds by alternating between two steps:

  • Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the “nearest” mean.S_{i}^{(t)}={\big \{}x_{p}:{\big \|}x_{p}-m_{i}^{(t)}{\big \|}^{2}\leq {\big \|}x_{p}-m_{j}^{(t)}{\big \|}^{2}\ \forall j,1\leq j\leq k{\big \}},where each x_{p} is assigned to exactly one S^{(t)}, even if it could be assigned to two or more of them.
  • Update step: Calculate the new means to be the centroids of the observations in the new clusters.m_{i}^{(t+1)}={\frac {1}{|S_{i}^{(t)}|}}\sum _{x_{j}\in S_{i}^{(t)}}x_{j}

The algorithm has converged when the assignments no longer change. Although, there is no guarantee that the optimum is found using this algorithm.

The FCM algorithm, also known as Fuzzy ISODATA, is one of the most frequently used methods in pattern recognition. In order to work, it goes like:

  • Choose a number of clusters.
  • Assign coefficients randomly to each data point for being in the clusters.
  • Repeat until the algorithm has converged (that is, the coefficients’ change between two iterations is no more than \varepsilon , the given sensitivity threshold):
    • Compute the centroid for each cluster. The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:{\displaystyle c_{k}={{\sum _{x}{w_{k}(x)}^{m}x} \over {\sum _{x}{w_{k}(x)}^{m}}}.}
    • For each data point, compute its coefficients of being in the clusters.

Although, it shouldn’t be forgotten how an algorithm that is set to work for a particular kind of model would generally fail on another kind of model.

For further interesting aspects about hard clustering and soft clustering, here’ the link to a video in which the two are compared to each other.

Here there are also interesting cluster analysis lectures.

Lascia un commento