Learning when training input and test data come from different distributions

Speaker: Shai Ben-David

Common machine learning theory makes some simplifying assumptions about the learning set up. A problematic such simplification is the assumption that the data available to the learner for training is a faithful representative of the data it will be later tested on. In this work we demonstrate that, counter to common belief, some performance guarantees can be made even without such an assumption. Clearly, if our training data is not assumed to be an an-biased sample of the target domain, learning must rely on some other knowledge about the test data distribution (or depend on the validity of some strong assumptions concerning the function to be learned).

In this work we address the scenario in which the knowledge about the test data distribution is acquired by viewing unlabeled data generated by that distribution. In contrast with previously published approaches, we also refrain from assuming any relationship between the labels and the structure of unlabeled data We present a learning paradigm for that setting and prove test error bounds in terms of parameters that can be reliably estimated from the learner's input data. I shall discuss the task of automated part-of-speech-tagging to demonstrate the applicability of this work.

This is joint work with John Blitzer, Koby Crammer and Fernando Pereira from UPENN