Decision-Theoretic Hazards

Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior.

Possible Failures

Coercion by Simulator

Consider a very intelligent AI pursuing a reward signal (apparently) controlled by humans. A standard concern is that at some point the AI would prefer to take control from the humans and use this control to give itself reward.

Instead, the AI may decide that it is quite likely to be in a simulation and begin behaving unpredictably. For example, the AI may decide that the vast majority of agents who share its experiences are living in simulations controlled by uFAIs which successfully took over universes like ours. Such uFAIs may adopt policies like intervening in simulations to administer or withhold reward according to whether the reward-seeking AI takes actions which facilitate a uFAI-takeover. Moreover, it may begin these interventions precisely when the AI discovers this line of argument.

At first it seems like a TDT-style agent trying to maximize a mathematically defined utility is not susceptible to this sort of attack (except insofar as utility may be defined in terms of a mathematical object like the universal distribution which can be manipulated by a simulator).  But unfortunately this is not clear.

A TDT agent in general will need to rely on its environment to do useful computation, although the details of this relationship are not yet clear. An AI running on a large computer should probably be able to reason about possible worlds in which that computer behaves in a wide variety of ways (for example, counterfactual or simulated worlds in which the computer violates its ordinary contract) and should reason about how the AI’s behavior in those possible worlds affects utility. Understanding the possible effects of this uncertainty seems out of reach for now, but it does seem plausible that a simulator’s threat could control actions even of some TDT-style agents (in retrospect we would probably reject a decision theory with this property, so these considerations just suggest developing a much better understanding of decision theory).

Altruism

Although we imagine an AI as maximizing a given utility function, if the AI is able to reason about counterfactuals and acausal influence, it may make decisions with concern for their effects not only if it were in a different situation but if it had a different utility function. This may be particularly problematic when we try and use self-interest to control AI behavior. For example, suppose we build a system comprised of many AI’s such that one (which we do not yet know) has the ability to guarantee for itself its largest possibility utility. If we rely on that AI choosing to maximize its own utility, we may be surprised. Natural decision theories may make decisions from behind a veil of ignorance with respect to the definition of utility itself, helping others even when they have no use for the generosity of others.

Pascal’s Wager

On reflection, humans seem to be uncertain about their endorsed response to the possibility of very large payoffs arising with negligible probability (particularly when that probability comes entirely from radical model uncertainty). For example, many utilitarian-types feel like they are indifferent to a gamble which would equiprobably increase or decrease the number of happy future humans by 50%, but are horrified by the possibility of a galaxy-scale AI investing nearly all of its resources exploring improbable possibilities for exotic physics or escape from a simulation, and even more objectionable outcomes are conceivable. If there is are a priori conceivable outcomes which seem 3^^^3 better than the best outcome which seems mathematically conceivable upon inspection, should an AI focus exclusively on possible worlds where its mathematical reasoning is untrustworthy?

These problems may be fixed by an appropriate choice of values or dissolved by more careful philosophical consideration, but it is also possible that they are debilitating for certain decision theories. For example, VNM notwithstanding, choosing actions which maximize a mathematically/algorithmically defined utility function may not be a tenable design decision if we want our agent to entertain logical uncertainty (e.g. about whether the defined utility function is bounded).

Game Theory

AIs we build may encounter other agents in our future light cone. Moreover, if they have utility functions that care about other possible (or impossible) worlds they are very likely to reason about agents in other possible worlds.

In such situations we want our AI to capture gains from trade and other advantages from interaction, but we are also concerned about the possibility of blackmail or manipulation. In the worst and most bizarre cases, counterfactual uFAIs may blackmail an AI we create into producing very little value. Handling these situations correctly–cooperating on appropriate one-shot prisoner’s dilemmas, refusing to give into threats when appropriate, and so on–seems to depend on decision theory rather than values.

Stranger Things than These

I don’t know how strange the world is, or what sort of completely unexpected thing can go wrong. If someone had suggested acausal trade to me a year ago, or implied that a simulator might be able to control the behavior of a carefully constructed agent, I would not have taken the possibility seriously. And if no one had suggested it to me, I would have confidently endorsed schemes that failed for completely unanticipated reasons. I can certainly imagine that there is more room for weirdness. It is very hard to imagine formally ruling out the possibility (as it seems to require understanding decision theory for our own use, first, to see what should qualify as a proof). We can begin to imagine all sorts of strangeness, ranging from agents inhabiting and controling the formal systems we (shouldn’t, but did) trust, to more exotic forms of philosophical angst leading to erratic behavior.

Punting Decision Theory

We may leave the problem “figure out decision theory” to an AI with some naive but (hopefully) good enough decision theory. Hopefully such an AI will in turn figure out decision theory, create a new agent with a more robust decision theory, and thereby avoid the above problems. There are a few issues here, though:

Early Difficulties

An agent may encounter philosophically troubling or game-theoretically challenging ideas while still running a flawed decision theory. For example, as it is about to start running the new TDT agent it built it might discover an argument that no matter what it does it can’t get utility more than X, and by not self-modifying it can get utility X, so call off the modification and become trapped by its decision theory.

Suboptimal attractors

Agents with some decision theories, even decision theories which seem manifestly suboptimal to humans, will not fix their problems if given the chance. For example, an agent vulnerable to Pascal’s muggings may never remove this vulnerability (because in its view it is an opportunity to capture immense value). Or, we could imagine a CDT agent modifying into some sort of “exceptionless decision theory:” do what the original agent would have done in your situation. Someone running exceptionless decision theory may still be vulnerable to blackmail from an adversary who commits appropriately.

It may be necessary to get some aspects of decision theory right at first if we want to avoid the failures above in the long run.

Value Drift

It is possible that an AI trying to maximize the utility function U using one decision theory will eventually produce an AI running an improved decision theory but with some modified (in the worst case unrecognizably) utility function U’. For example, if an early decision theory leads to insufficiently cautious self-modification, the utility function may drift randomly; if the early AI encounters philosophical difficulties that lead it into existential despair, the utility function may only be concerned with worlds in which the AI’s early reasoning or perceptions were delusional.

Leave a comment