A Study of Data Clusterability

Speaker: Rita Ackerman

We investigate measures of the clusterability of data sets. Namely, ways to define how `strong' or `conclusive' is the clustering structure of a given data set.

Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss.

We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent.

Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem.

Joint work with Shai Ben-David.