A Notion of Stability for Sample-based Clustering

Speaker: Shai Ben-David

Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences over biology to computer science, people try to get a first understanding of their data by identifying meaningful groups among the data points. There has been extensive work on algorithms for clustering over the last several decades. However, the vast majority of that work is heuristic. Despite the large number of algorithms and applications, the goal and the proper interpretation of clustering remain fuzzy and vague. While there exists a significant body of work on computational complexity issues of clustering tasks, the theoretical foundations of clustering, especially in statistical settings, seem to be distressingly meager. Common analysis of clustering algorithms do not address the notion of clustering in a principled way.

In this talk I shall describe research aiming to draft theoretical foundations of statistical clustering and to develop formal tools for the analysis, comparison, and evaluation of clustering algorithmic paradigms. We believe, even though the question ``what clustering is'' is difficult to answer in such generality, there are important sub-questions which are well defined and can and should be investigated in a general statistical framework. As a first step in this agenda, I'll describe a notion of stability of clustering as a necessary condition for successful/meaningful clustering.

I shall explain and demonstrate how this notion can serve as a model-selection principle for choosing clustering algorithms and clustering parameters (like the number of clusters) for any given clustering task.