Reward Schemes for Learning Systems

Tom Westerdale

The credit assignment problem for production systems in delayed payoff situations has been formalized in a conceptual model in which the environment of the system is a finite automaton.   A reward scheme has been exhibited which avoids detrimental biases even when eligibility sets overlap.  The demonstration that such biases are avoided is an analogue of the proof of Fisher's fundamental theorem of natural selection.  To help provide a conceptual testbed for reward schemes, a new cascade decomposition of finite automata has been obtained. Sometimes current payoff is due to a system action taken a long time ago.  The question is, how long ago might such an action have occurred?

This question can be usefully re-formulated if the environment is decomposed using the cascade decomposition.   In the re-formulation it is often possible to ignore all but the first component of the decomposition.

Subgoal reward schemes avoid the necessity of explicitly remembering past system actions.   Current project work is investigating the differences between genetic schemes and subgoal reward schemes.  Holland's Bucket Brigade is a subgoal reward scheme particularly suited to the investigation of these differences.

It is hoped that a point of view can be found from which the Bucket Brigade approach and the genetic approach will not look as different as they currently appear.

Subgoal reward schemes are vulnerable to freeloaders.  Obviously, group selection would combat freeloading.  We have indicated how group selection might occur when the bucket brigade is used.

Classifier systems can combine subgoal reward schemes and genetic schemes, and so can highlight the differences between schemes.  We have made a tentative start at investigating the interaction between these two types of schemes in classifier systems.

Q-learning is another subgoal reward scheme.  We have shown that it has some disadvantages vis a vis the bucket brigade.