A Notion of Stability for Sample-based Clustering
Speaker: Shai Ben-David
Clustering is one of the most widely used techniques for
exploratory data analysis. Across all disciplines, from social
sciences over biology to computer science, people try to get a
first understanding of their data by identifying meaningful
groups among the data points. There has been extensive work on
algorithms for clustering over the last several decades.
However, the vast majority of that work is heuristic. Despite
the large number of algorithms and applications, the goal and
the proper interpretation of clustering remain fuzzy and vague.
While there exists a significant body of work on computational
complexity issues of clustering tasks, the theoretical
foundations of clustering, especially in statistical settings,
seem to be distressingly meager. Common analysis of clustering
algorithms do not address the notion of clustering in a
principled way.
In this talk I shall describe research aiming to draft
theoretical foundations of statistical clustering and to
develop formal tools for the analysis, comparison, and
evaluation of clustering algorithmic paradigms. We believe,
even though the question ``what clustering is'' is difficult to
answer in such generality, there are important sub-questions
which are well defined and can and should be investigated in a
general statistical framework. As a first step in this agenda,
I'll describe a notion of stability of clustering as a
necessary condition for successful/meaningful clustering.
I shall explain and demonstrate how this notion can serve as a
model-selection principle for choosing clustering algorithms
and clustering parameters (like the number of clusters) for any
given clustering task.