Speaker: Mohammad Ghavamzadeh

We present a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighing each classification mistake by its actual regret, i.e., the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space (classifier), and a capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. Furthermore, it confirms the intuition that classification-based policy iteration algorithms can be favorably compared to value function based approaches when the good policies are easier to be represented and learned than their corresponding value functions. We also study the consistency of the algorithm when there exists a sequence of policy spaces with increasing capacity.

Bio:

Mohammad Ghavamzadeh received a Ph.D. degree in Computer Science from the University of Massachusetts Amherst in 2005. He was a postdoctoral fellow at the Department of Computing Science at the University of Alberta from 2005 to 2008. Since 2008 he has been a researcher at INRIA Lille - Nord Europe, team SequeL. His research interests lie primarily in Artificial Intelligence and Machine Learning, with emphasis on decision making under uncertainty using principled mathematical tools from probability theory, decision theory, and statistics. His current research is mostly focused on using recent advances in statistical machine learning to develop more efficient reinforcement learning algorithms.