# Decision-Theoretic Hazards

Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. Continue reading

# “Proof” of Friendliness

The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.

Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity’s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn’t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate F(X) = “The humans run the program described by X.”

To top it off, with the help of your oracle AI you found the code for a “best possible AI”, call it FAI, and a proof of the theorem:

There exists a constant Best such that U ≤ Best, but F(FAI) implies U =  Best.”

Each of these steps you did with incredible care. You have proved beyond reasonable doubt that U and F represent what you want them to.

You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running FAI, they will still receive U ≤ Best, so why bother?

Now satisfied and with the scheduled moment arrived, you finally run FAI. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become as confident as possible that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).

Just a lighthearted illustration that decision theory isn’t only hard for AI.

(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)

Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn’t change the fact that you would rather not use this procedure.

# Counterfactual Blackmail (of oneself)

This post is not going to make any new claims, but this is a helpful intuition-pump for me.

### Loebian Problems in Decision Theory

Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn’t take X, but is uncertain about its utility if it takes X.

A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn’t take X, how is it supposed to reason about “what U would be if I take X?” Continue reading

# Abstract Randomness and Formal CDT

It would be nice to have a working formalization of TDT, but first I am just going to shoot for a working formalization of CDT in a mathematical universe. The difficulty in this problem may be described as locating yourself within the universe (to understand not just a description of the universe but also how your action controls it). To see why this might not be completely straightforward, see “AIXI and Existential Despair.”

# Decision Theory Desiderata

“Expected utility maximization” is a generally accepted decision procedure, at least up to computational limitations. But it is worth remembering that as of today, no one seems to have produced a formalization of EU maximization that leads to sane behavior. I’ll briefly review some of the properties we might like our decision theories to have, and observe that for the most part we can’t write code that satisfies any of them in much generality, even if we could deal with the classical AI problems of inference and optimization.

# Beliefs Which are Always Wrong

In the game “matching pennies,” a player and a predictor simultaneously choose either Heads or Tails. The player wins a point if his play differs from the predictor and loses a point otherwise. We also offer the player the ability to opt out of the game for a small penalty.

Personally, after losing at matching pennies a relatively small number of times I will suspect that the predictor is able to reliably to predict my behavior and so will begin opting out. We expect that a reasonable AI will behave similarly.  But we run into difficulties when we try and formalize the reasoning that led to our behavior.

# AIXI and Existential Despair

(cross-posted from lesswrong)

It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.

# Clarification of AI Reflection Problem

(Cross-posted from lesswrong)

Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action.  By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI.  Our current understanding appears to be inadequate for guiding such an agent’s behavior, for (at least) the following reason:

If A does not believe “A’s beliefs reflect reality,” then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A’s beliefs don’t reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe “A’s beliefs reflect reality,” then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = “A does not believe Q”? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.

This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.