In the game “matching pennies,” a player and a predictor simultaneously choose either Heads or Tails. The player wins a point if his play differs from the predictor and loses a point otherwise. We also offer the player the ability to opt out of the game for a small penalty.
Personally, after losing at matching pennies a relatively small number of times I will suspect that the predictor is able to reliably to predict my behavior and so will begin opting out. We expect that a reasonable AI will behave similarly. But we run into difficulties when we try and formalize the reasoning that led to our behavior.
For example, consider any Bayesian using a causal decision theory, who maintains an implicit distribution over possible world models and chooses an approximately utility maximizing action with respect to this distribution.
Regardless of the current distribution maintained by such an algorithm, at least one of the predictor’s plays must have probability at least 50%–if heads has probability at least 50% then the player would rather choose tails than opt out, and otherwise the probability of tails is at least 50% and the player would rather choose heads than opt out.
In order to choose to opt out, an agent must be able to model non-causal relationships between its behavior and the world model. A UDT agent would handle this decision properly, if UDT could be suitably pinned down, but it still leaves us with the interesting situation with respect to beliefs: *whatever* the agent believes about the predictor’s play, the agent will be wrong (or miscalibrated), at least in the sense that those beliefs should not be used to guide action. In order to motivate action, the agent should be reasoning about the predictor’s behavior *conditioned* on the agent’s choice
This may not seem so troubling. After all, in general we need to reason about conditional rather than unconditional probabilities in decision theory, and the inclusion of a predictor seems to make this an unfair testing ground for beliefs as such.
But consider a “math intuition model,” which assigns subjective probabilities to mathematical facts. We generally imagine such a module as dealing with beliefs rather than actions, but a math intuition module can be presented with an identical problem: for example, a math intuition module reasoning about arithmetic may face an arithmetic problem Q which encodes the intuition module’s own behavior, and the answer to the problem may be controlled by the module’s output in such a way that the module is always wrong about Q. (Q is the direct analog of Godel’s paradoxical sentence.) So it seems that any math intuition module must be systematically miscalibrated about certain facts and must be aware of that miscalibration, but is unable to do anything about it.
This suggests that we should, at the very least, probably cast the AI reflection problem not as “How can an agent avoid going crazy while being convinced of its own correctness” but rather “How can an agent justify expanding its own influence when it is convinced of its own fallibility?”