Reinforcement Behavior in Repeated Games

Jonathan Bendor, Dilip Mookherjee and Debraj Ray

 

September 1998

 

Abstract:

This paper examines the long run implications of `satisficing' behavior of players in a repeated game setting. Such behavior is also known as `stimulus learning' in the psychology literature, or `reinforcement learning' in recent game theoretic literature. It is based on the notion that individuals adapt their behavior in the light of experience, rather than maximize some utility function. The standard by which experiences are judged to be satisfactory is defined by an aspiration level, against which achieved payoffs are evaluated. Aspirations are socially inherited, or evolve slowly as a player gains experience. Such behavior rules are informationally and cognitively less demanding than those entailing maximization of some utility function. They appear particularly plausible when players are unable to form a coherent ``model'' of the environment they operate in.

For instance, they might lack information concerning the strategic structure of the game or the actions selected by others in the past. Alternatively, they may not process such information even when it is available. However, relatively little is known at a theoretical level concerning the long run implications of such forms of behavior in settings where a given set of players repeatedly interact, in contrast for instance to the rich literature on the implications of fictitious play or similar models of boundedly rational learning that involves players maximizing some payoff function

Section 2 presents the formal model, involving play of a repeated game by two players. Each player has an aspiration, which is a payoff level that that player aspires to. A player's psychological state is represented by a probability mixture over different actions. As play evolves and players acquire experience, certain actions are reinforced because the payoffs they generate exceed aspirations. We capture this form of reinforcement behavior in the form of a reinforcement behavior rule, which describes how choice probabilities evolve with experience.

We assume that aspirations are held fixed during the repeated interaction. However, they are determined endogenously, by a consistency condition that requires the equality of aspirations with long run average payoffs (induced in turn by those aspirations). This may be thought of as the steady state of a model where successive generations of players play the game, each with a fixed aspiration determined by the long run average payoff experience of previous generations.

A consistent (long-run) outcome of this game is thus a long run limit distribution over players' states resulting from some initial state, in which average payoffs converge almost surely to the aspiration level of each player. However, the criterion of consistency is satisfied by every pure strategy outcome, and is thus insufficient to produce any useful predictions. We therefore introduce an additional restriction: a consistent long-run outcome, or behavior convention for short, is said to be stable if it satisfies a criterion of robustness with respect to perturbations of players' states. This criterion requires that starting with this convention, if the state of exactly one player is perturbed just once to a neighboring (totally mixed) state, it should cause the same convention to be eventually re-established. Despite the simplicity of this stability notion, it turns out to be closely related to the selection provided by persistent small `trembles' to players' states, in the manner discussed by the stochastic evolution literature, e.g., Kandori, Mailath and Rob (1993) and Young (1993). The precise relationship between the two notions of stability is discussed in detail in Appendix A.

Stable conventions may be degenerate, in that the process converges to a single state. In that case it must converge to a pure strategy state, so there is no randomness in selected actions at all (Proposition 1). Our main result, Proposition 2, provides a complete characterization of all pure stable conventions. Broadly speaking, such conventions involve a pure strategy state which is either efficient or a protected Nash equilibrium, the latter concept referring to a strict Nash equilibrium with the additional property that a single player's deviation cannot hurt the other player. And the implication runs both ways: such states are stable conventions.

A corollary of this characterization is an existence result for pure stable states in ``generic'' games (Proposition 3).

We also establish restrictions on nondegenerate stable conventions, which involve a distribution over different mixed strategy states. Proposition 4 shows that the average payoffs from such a distribution must be individually rational (i.e., players must achieve their pure strategy maxmin payoffs), and cannot be Pareto dominated by any pure strategy pair. An informal discussion of the reasoning underlying these results is provided in Section 4.

These characterization results yield sharp predictions for several classes of games, which are discussed in Section 5. In a game of common interests and common payoffs, there is a unique stable convention, concentrated on the efficient pure strategy pair. In games of common interest, the only stable conventions are pure; one of these is concentrated on the efficient pure strategy pair, and any other must constitute a a protected Nash equilibrium of the game. In the Prisoners Dilemma, both mutual cooperation and mutual defection are (the only) stable pure outcomes. A similar result holds for a class of supermodular collective action games with more than two effort levels: maximal cooperation and maximal defection are the only possible pure outcomes. In Cournot or Bertrand oligopoly without capacity constraints, the only pure stable conventions involve maximal collusion. In a Downsian model of electoral competition, the policy most preferred by the median voter is the unique stable outcome.