GSO-2011-Tishby

GSO-2011: Naftali Tishby

The value of future-information - the missing piece in reinforcement learning

One of the most striking characterizations of life is the ability to efficiently extract information - through sensory perception, and exploit it - through behavior. There is a growing empirical evidence that information seeking is as important in biology as reward seeking. Yet our basic algorithms for describing planning and behavior, in particular reinforcement learning (RL), so far ignored this component. In this talk I will describe new extensions of reinforcement learning that combine information seeking and reward seeking behaviors in an optimal way. I will argue that Shannon's information measures provide the only consistent way for trading information with expected future reward and show how the two can be naturally combined in the frameworks of Markov-Decision-Processes (MDP) and Dynamic Programming (DP). This new framework unifies techniques from information theory (like the Huffman coding algorithm) with methods of optimal control (like the Bellman equation). We show that the resulting optimization problem has a unique global minimum and convergence (even that it lacks convexity). Moreover, the tradeoff between information and value is shown to be robust to fluctuations in the reward values by using the PAC-Bayes generalization bound, providing another interesting justification to its biological relevance.

Based on joint work with Daniel Polani, Jonathan Rubin and Ohad Shamir.