# “Proof” of Friendliness

The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.

Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity’s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn’t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate F(X) = “The humans run the program described by X.”

To top it off, with the help of your oracle AI you found the code for a “best possible AI”, call it FAI, and a proof of the theorem:

There exists a constant Best such that U ≤ Best, but F(FAI) implies U =  Best.”

Each of these steps you did with incredible care. You have proved beyond reasonable doubt that U and F represent what you want them to.

You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running FAI, they will still receive U ≤ Best, so why bother?

Now satisfied and with the scheduled moment arrived, you finally run FAI. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become as confident as possible that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).

Just a lighthearted illustration that decision theory isn’t only hard for AI.

(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)

Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn’t change the fact that you would rather not use this procedure.

# Counterfactual Blackmail (of oneself)

This post is not going to make any new claims, but this is a helpful intuition-pump for me.

### Loebian Problems in Decision Theory

Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn’t take X, but is uncertain about its utility if it takes X.

A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn’t take X, how is it supposed to reason about “what U would be if I take X?” Continue reading

# Confronting Incompleteness

Consider the “Truth game”, played by an agent A:

A outputs a sequence of mathematical assertions S1, S2, …

For each statement S, A receives exp(-|S|) utilons.

If A makes any false statements (regardless of how many it makes) it receives -infinity utilons (or just a constant larger than the largest possible reward).

# Speed Prior and MWI

The speed prior is an analog of the universal distribution which penalizes computationally expensive hypotheses. The speed prior is of theoretical interest, but it is also justified if we suspect that we are living in a simulation: computationally simpler hypotheses requires fewer resources and we should therefore expect them to be simulated more times (if constant resources are dedicated to several independent simulations, a simulation which is twice as expensive can only be run half as many times).

It has been argued that the speed prior implies that wave function collapse is probably real, because simulating other branches would be computationally expensive, and that we should not expect quantum computation to work. This strikes me as naive: we have practically smuggled in the hypothesis of a classical universe by introducing the speed prior (or at least by suggesting that it reflects reasonable beliefs for an observer being simulated).  Continue reading

# Avoiding Simulation Warfare with Bounded Complexity Measures

Suppose that I try to define a human’s decision process by observing some decisions and conditioning the universal prior on agreement with those decisions (see here). I have argued that the behavior of the result on new decisions is going to be dominated by the winner of a certain simulation arms race–most of the mass of the universal prior will be spread out amongst simulations spread throughout our universe, and whoever controls those simulations determines the resulting posterior.

This state of affairs is pretty much terrible, and in general the obstruction seems so bad that we shouldn’t ever try to specify anything formally by conditioning the universal distribution and hoping to get a pointer to our universe.

The problem is that describing an event as “that thing in the universe which starts off looking like this…” is dangerous. Instead, we would really like to specify a human’s decision process by saying “you arrange all of these atoms like so, and then you apply these laws of physics…” (This would also deal with the problems with specifying counterfactuals, although those seem to be less severe.)

This is likely to lead to a much larger description, but one which can’t be nearly so easily controlled by a simulator. Fortunately, though the resulting description is longer it is still a very good compression of the human’s behavior, and it also has the virtue of being computationally inexpensive. So we may hope to pinpoint this description by using a prior which weights explanations not by the Kolmogorov complexity, but by some different complexity measure which favors computationally inexpensive simulations.

# Formal Instructions

I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like “Now take this decision process, embed it in this abstract world (where we believe it will be able to create a flourishing utopia or whatever) in this way, and make the universe look like that.” We’ve maybe gotten some leverage on the first parts (though right now the difficulties here loom pretty large), which involve precisely defining certain concepts for an AI, but it isn’t yet clear how you could precisely tell the AI to do something. Here is a stab at this other problem.

Rather than directly asking an AI to simulate a particular universe, we will ask it to find the value on a particular physical input channel in that universe, and then exert control from within the universe to ensure that calculating this value requires simulating the universe (or at least capturing whatever moral value we hope would come from a simulation of that universe). Continue reading

# Entropy, Reversibility, and Uncomputation

Summary: Anything you can compute using an infinite amount of negentropy, time T, and space S, can be computed using only about S negentropy if you are willing to spend a little bit of extra space and time (S log T and T^1.5, for example, or S and 2^S * T). So future universes may be constrained by available space and time rather than negentropy, and the potential for computation may be exponentially larger than we would otherwise expect. Continue reading

# Epistemic Chicken

Consider a fixed goal-seeking agent , who is told its own code and that its objective function is U = { T if A(<A>,<U>) halts after T steps, 0 otherwise }. Alternatively, consider a pair of agents A, B, running similar AIs, who are told their own code as well as their own utility function U = { -1 if you don’t halt, 0 if you halt but your opponent halts after at least as many steps, +1 otherwise }. What would you do as A, in either situation? (That is, what happens if A is an appropriate wrapper around an emulation of your brain, giving it access to arbitrarily powerful computational aids?)

# Abstract Randomness and Formal CDT

It would be nice to have a working formalization of TDT, but first I am just going to shoot for a working formalization of CDT in a mathematical universe. The difficulty in this problem may be described as locating yourself within the universe (to understand not just a description of the universe but also how your action controls it). To see why this might not be completely straightforward, see “AIXI and Existential Despair.”

# Risk Arbitrage

People have different risk profiles, and different beliefs about the future. But it seems to me like these differences should probably get washed out in markets, so that as a society we pursue investments if and only if they have good returns using some particular beliefs (call them the market’s beliefs) and with respect to some particular risk profile (call it the market’s risk profile).

As it turns out, if we idealize the world hard enough these two notions collapse, yielding a single probability distribution P which has the following property: on the margins, every individual should make an investment if and only if it has a positive expected value with respect to P. This probability distribution tends to be somewhat pessimistic: because people care about wealth more in worlds where wealth is scarce (being risk averse), events like a complete market collapse receive higher probability under P than under the “real” probability distribution over possible futures.