“Proof” of Friendliness

The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.

Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity’s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn’t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate F(X) = “The humans run the program described by X.”

To top it off, with the help of your oracle AI you found the code for a “best possible AI”, call it FAI, and a proof of the theorem:

There exists a constant Best such that U ≤ Best, but F(FAI) implies U =  Best.”

Each of these steps you did with incredible care. You have proved beyond reasonable doubt that U and F represent what you want them to.

You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running FAI, they will still receive U ≤ Best, so why bother?

Now satisfied and with the scheduled moment arrived, you finally run FAI. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become as confident as possible that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).

Just a lighthearted illustration that decision theory isn’t only hard for AI.

(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)

Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn’t change the fact that you would rather not use this procedure.

Counterfactual Blackmail (of oneself)

This post is not going to make any new claims, but this is a helpful intuition-pump for me.

Loebian Problems in Decision Theory

Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn’t take X, but is uncertain about its utility if it takes X.

A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn’t take X, how is it supposed to reason about “what U would be if I take X?” Continue reading

Formal Instructions

I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like “Now take this decision process, embed it in this abstract world (where we believe it will be able to create a flourishing utopia or whatever) in this way, and make the universe look like that.” We’ve maybe gotten some leverage on the first parts (though right now the difficulties here loom pretty large), which involve precisely defining certain concepts for an AI, but it isn’t yet clear how you could precisely tell the AI to do something. Here is a stab at this other problem.

Rather than directly asking an AI to simulate a particular universe, we will ask it to find the value on a particular physical input channel in that universe, and then exert control from within the universe to ensure that calculating this value requires simulating the universe (or at least capturing whatever moral value we hope would come from a simulation of that universe). Continue reading

Abstract Randomness and Formal CDT

It would be nice to have a working formalization of TDT, but first I am just going to shoot for a working formalization of CDT in a mathematical universe. The difficulty in this problem may be described as locating yourself within the universe (to understand not just a description of the universe but also how your action controls it). To see why this might not be completely straightforward, see “AIXI and Existential Despair.”

Continue reading

Specifying (non-decision-theoretic) Counterfactuals

Here is a simple trick for specifying a computer in the physical world’s future inputs: run the computer for a long time, and then ask for the simplest description of the resulting sequence of inputs. The resulting description is a good predictor for future inputs, provided we live in a suitable universe.

(This is vulnerable to all of the same attacks defined in “Hazards,” and if we really want to get access to the universe as a whole, rather than just to a simulation of a single brain, it will be much harder to get around these problems.)

Now suppose we have a single bit X on a computer, and we would like to talk about the counterfactual world in which X’s value was flipped. How can we do this? Or perhaps we would like to consider an entire ensemble of possible counterfactuals in which we were given one of exponentially many possible messages m1, m2, ….

Continue reading

Hazards for Formal Specifications

I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human’s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of agents’ behavior):

Continue reading