Learning when training input and test data come from different distributions
Speaker: Shai Ben-David
Common machine learning theory makes some simplifying assumptions
about the learning set up. A problematic such simplification is the
assumption that the data available to the learner for training is a
faithful representative of the data it will be later tested on. In
this work we demonstrate that, counter to common belief, some
performance guarantees can be made even without such an assumption.
Clearly, if our training data is not assumed to be an an-biased sample
of the target domain, learning must rely on some other knowledge about
the test data distribution (or depend on the validity of some strong
assumptions concerning the function to be learned).
In this work we address the scenario in which the knowledge about the
test data distribution is acquired by viewing unlabeled data generated
by that distribution. In contrast with previously published
approaches, we also refrain from assuming any relationship between the
labels and the structure of unlabeled data
We present a learning paradigm for that setting and prove test error
bounds in terms of parameters that can be reliably estimated from the
learner's input data. I shall discuss the task of automated
part-of-speech-tagging to demonstrate the applicability of this work.
This is joint work with John Blitzer, Koby Crammer and Fernando
Pereira from UPENN