Constrained Clustering — An Introduction

Ricciuti Federico
4 min readMar 27, 2021
credit: https://www.data-science-hub.com/

As the name suggests, Constrained Clustering is the study and the development of clustering algorithms which try to integrate previous knowledge of the data in the clustering process.

I decided to write this article because I think that the field of Constrained Clustering is not very known, it is not mainstream as many other fields of Machine Learning but it can be useful in some situations. So I think that it is important to at least know about its existence.

Introduction

Generally speaking, in standard clustering algorithms we try to divide the original dataset in non intersecting groups (clusters) such that the distances between the instances of the same group are smaller than the distances with respect to the instances of the other groups.

In constrained clustering algorithms, the process is not totally unsupervised, in addition to the original dataset D, we provide some additional information about the desired clusters, such as, but not limited to, must link and cannot link constraints.

In this article I will focus on must link and cannot link constraints but some constrained clustering algorithms allow to specify other type of constraints, such as size constraints.

  • (a,b) ∈ must-link ⊆ DxD ↔ the instance a and b must be in the same cluster
  • (a,b) ∈ cannot-link ⊆ DxD ↔ the instance a and b must not be in the same cluster
  • The must-link is a symmetric, reflexive and transitive relation (it is an equivalence relation)
  • The cannot-link relation is a symmetric relation
  • If (a,b) ∈ must-link ∧ (b,c) ∈ cannot-link → (a,c) ∈ cannot-link

These type of constraints are also very common in practice, for example you can extract them from a labeled subset of the data, they naturally provide a combination of must link and cannot link constraints.

A lot of algorithms have been proposed to consider these types of constraints but in this article I will focus on the two that I think are the most known: COP-KMeans and PCK-Means. These algorithms, as the name suggests, are variation of the well known K-Means clustering algorithm, the only differences are in one of the step of the original K-Means algorithm.

Constrained K-Means (COP-KMeans)

It considers the constraints as hard constraints, they must be respected. The resulting clusters, if the algorithm doesn’t fail, will respect the user-defined constraints.

The main problems of the COP-KMeans algorithm is that it could fail if the constrained conditions can not be satisfied during the assignment step (1), it is also very sensible to noises in the constraints, the misspecification of some constraints highly influence the resulting clusters.

Another problem with COP-KMeans is that the assignment step (1) depends on the order in which the istances are visited. If there are not constraints, COP-KMeans is equivalent to the standard K-Means.

Pairwise Constrained K-Means (PC-KMeans)

This algorithm can be seen as a relaxed version of the COP-KMeans algorithm. In this algorithm it could happen that some of the initial constraints will not be satisfied at the end of the algorithm, the algorithm will not fail if some of the constraints are not satisfiable. In the paper of PC-KMeans it is also proposed an algorithm to choose the most informative constraints to be used in the clustering step but in this article I decided to emphasize only the clustering algorithm.

If we have some information of how much the single constraints are important, we can differentiate the value of the penalization terms w_{i,j} and \hat{w}_{i,j}, but generally these information are not available and all of them are set to the same value.

As in the case of the COP-KMeans algorithm, in the PC-KMeans algorithm, the assignment step (1) depends on the order in which the istances are visited. Another problem is that we have more parameters to be chosen, the penalization terms.

If the penalization terms are set to zero, PC-KMeans is equivalent to the standard K-Means.

References

Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001, June). Constrained K-Means Clustering with Background Knowledge. In ICML (Vol. 1, pp. 577–584).

Basu, S., Banerjee, A., & Mooney, R. J. (2004, April). Active Semi-Supervision for Pairwise Constrained Clustering. In Proceedings of the 2004 SIAM international conference on data mining (pp. 333–344). Society for Industrial and Applied Mathematics.

Acknowledgements

The Story has been written by:

Federico Ricciuti, Data Scientist

Feel free to add me on Linkedin:
linkedin.com/in/federico-ricciuti-b490ab59

In collaboration with:

www.data-science-hub.com

--

--