Constrained Clustering — An Introduction

credit: https://www.data-science-hub.com/

As the name suggests, Constrained Clustering is the study and the development of clustering algorithms which try to integrate previous knowledge of the data in the clustering process.

I decided to write this article because I think that the field of Constrained Clustering is not very known, it is not mainstream as many other fields of Machine Learning but it can be useful in some situations. So I think that it is important to at least know about its existence.

Introduction

In constrained clustering algorithms, the process is not totally unsupervised, in addition to the original dataset D, we provide some additional information about the desired clusters, such as, but not limited to, must link and cannot link constraints.

In this article I will focus on must link and cannot link constraints but some constrained clustering algorithms allow to specify other type of constraints, such as size constraints.

  • (a,b) ∈ must-link ⊆ DxD ↔ the instance a and b must be in the same cluster
  • (a,b) ∈ cannot-link ⊆ DxD ↔ the instance a and b must not be in the same cluster
  • The must-link is a symmetric, reflexive and transitive relation (it is an equivalence relation)
  • The cannot-link relation is a symmetric relation
  • If (a,b) ∈ must-link ∧ (b,c) ∈ cannot-link → (a,c) ∈ cannot-link

These type of constraints are also very common in practice, for example you can extract them from a labeled subset of the data, they naturally provide a combination of must link and cannot link constraints.

A lot of algorithms have been proposed to consider these types of constraints but in this article I will focus on the two that I think are the most known: COP-KMeans and PCK-Means. These algorithms, as the name suggests, are variation of the well known K-Means clustering algorithm, the only differences are in one of the step of the original K-Means algorithm.

Constrained K-Means (COP-KMeans)

The main problems of the COP-KMeans algorithm is that it could fail if the constrained conditions can not be satisfied during the assignment step (1), it is also very sensible to noises in the constraints, the misspecification of some constraints highly influence the resulting clusters.

Another problem with COP-KMeans is that the assignment step (1) depends on the order in which the istances are visited. If there are not constraints, COP-KMeans is equivalent to the standard K-Means.

Pairwise Constrained K-Means (PC-KMeans)

If we have some information of how much the single constraints are important, we can differentiate the value of the penalization terms w_{i,j} and \hat{w}_{i,j}, but generally these information are not available and all of them are set to the same value.

As in the case of the COP-KMeans algorithm, in the PC-KMeans algorithm, the assignment step (1) depends on the order in which the istances are visited. Another problem is that we have more parameters to be chosen, the penalization terms.

If the penalization terms are set to zero, PC-KMeans is equivalent to the standard K-Means.

References

Basu, S., Banerjee, A., & Mooney, R. J. (2004, April). Active Semi-Supervision for Pairwise Constrained Clustering. In Proceedings of the 2004 SIAM international conference on data mining (pp. 333–344). Society for Industrial and Applied Mathematics.

Acknowledgements

Federico Ricciuti, Data Scientist

Feel free to add me on Linkedin:
linkedin.com/in/federico-ricciuti-b490ab59

In collaboration with:

www.data-science-hub.com

Data Scientist