(cross-posted from lesswrong)

It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.

## Setup

Let A be an agent running on a physical computer, implementing some approximate version of AIXI. Suppose that A is running inside of an indestructible box, connected to the external world by an input wire W1 and an output wire W2.

Suppose that this computer exists within a lawful physical universe, governed by some rules which can be inferred by A. For simplicity, assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand. (Similar problems arise for identical reasons in more realistic settings, where A will work instead with a local model of reality with more extensive boundary conditions and imperfect predictability, but this simplified setting is easier to think about formally.)

Recall the definition of AIXI: A will try to infer a simple program which takes A’s outputs as input and provides A’s inputs as output, and then choose utility maximizing actions with respect to that program. Thus two models with identical predictive power may lead to very different actions, if they give different predictions in counterfactuals where A changes its output (this is not philosophy, just straightforward symbol pushing from the definition of AIXI).

## AIXI’s Behavior

First pretend that, despite being implemented on a physical computer, A was able to perform perfect Solomonoff induction. What model would A learn then? There are two natural candidates:

- A’s outputs are fed to the output wire W2, the rest of the universe (including A itself) behaves according to physical law, and A is given the values from input wire W1 as its input. (Model 1)
- A’s outputs are ignored, the rest of the universe behaves according to physical law, and A is given the values from W1 as its input. (Model 2)

Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A’s control over W2’s values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.

If A believed Model 2, what would its actions be? Well, in Model 2 A’s outputs are completely ignored! So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A’s actions. So maybe A will make decisions *as if* it believed Model 1, and therefore behave appropriately for a surprising reason? In fact this is unlikely: there are many other models of intermediate complexity, for example:

- A’s outputs are usually ignored, the rest of the universe behaves physically, and then A is given the values from W1 as its input. However, if A ever outputs the sequence 0011011000111010100, then A’s inputs are modified to include a large reward signal on the next step. (Model 3)

Model 3 is an intentionally silly example, but hopefully the point is clear: Model 1 is enormously more complicated than the simplest explanation for A’s observations, and so there are many other (potentially very strange and arbitrary) models which account for the data but are simpler than Model 1. It is these models which will determine A’s behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal’s Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).

For example, if A suspects Model 3 may be true it will be eager to test this hypothesis by outputting 0011011000111010100. It will find that it continues to receive inputs from W1, but no matter, it will move on to testing Model 4, which has the same form but with a different target bitstring. Of course this is not what A will actually do, as there will be other models with complicated effects on behavior, but this gives the flavor of A’s failure.

If A somehow did accept Model 1, then we would be back in the situation normally discussed on Less Wrong: A believes that the values on W2 are magically made equal to A’s outputs, and so is unconcerned with its own real physical instantiation. In particular, note that having some uncertainty between Model 1 and Model 2 is not going to save A from any of these problems: in the possible worlds in which Model 2 is true, A doesn’t care at all what it does (A doesn’t “want” its physical instantiation to be destroyed, but by the same token it believes it has no control), and so A’s behavior reduces to the normal self-destructive behavior of Model 1.

## Approximate AIXI’s Behavior

An approximate version of AIXI may be able to save itself from existential despair by a particular failure of its approximate inference and a lack of reflective understanding.

Because A is only an approximation to AIXI, it cannot necessarily find the simplest model for its observations. The real behavior of A depends on the nature of its approximate inference. It seems safe to assume that A is able to discover some approximate versions of Model 1 or Model 2, or else A’s behavior will be poor for other reasons (for example, modern humans can’t infer the physical theory of everything or the initial conditions of the universe, but their models are still easily good enough to support reductionist views like Model 2), but its computational limitations may still play a significant role.

#### Why A might not fail

How could A believe Model 1 despite its prior improbability? Well, note that A cannot perform a complete simulation of its physical environment (since it is itself contained in that environment) and so can never confirm that Model 2 really does correctly predict reality. It can acquire what seems to a human like overwhelming evidence for this assertion, but recall that A is learning an input-output relationship and so it may assign *zero* probability to the statement “Model 2 and Model 1 make identical predictions,” because Model 1 depends on the indeterminate input (in particular, if this indeterminate was set to be a truly random variable, then it would be mathematically sound to assign zero probability to this assertion). In this case, no amount of evidence will ever allow A to conclude that Model 2 and Model 1 are identically equivalent–any observed equivalence would need to be the result of increasingly unlikely coincidences (we can view this as a manifestation of A’s ignorance about its own implementation of an algorithm).

Now consider A’s beliefs about W2. It is relatively easy for A to check (for almost all timesteps) that Model 1 correctly predicts each bit on W2, while A only has enough time to check that Model 2 correctly predicts only a few of these bits. Therefore the probability of Model 2 must be decreased by A’s estimate of the likelihood that Model 2 would happen to set the correct value for all of the bits that A didn’t have time to verify. Model 1’s probability must be decreased likewise, but because A was able to check more of Model 1’s values, Model 1 leaves less unexplained data and may not be as unlikely as Model 2.

#### Why A probably fails anyway

In general, it is very fragile to expect that an AI will behave correctly because it *fails to realize *something. There are a number of obvious reasons that A might accept the reductionist hypothesis, even acknowledging the above argument.

First, an effective approximate inference module seems unlikely to make the mistake described in the last section. Making this mistake seems to correlate with other errors which may disrupt normal “sane” behavior: in general, given a model M which the agent cannot exactly simulate, uncertainty about the M’s outputs (even if M is known to produce mostly the same output, and never known to produce incorrect output) causes the model’s probability to drop off exponentially with the number of unverified computations. It seems as though any “sane” AI should be able to assign some constant probability to the hypothesis that this model exactly captures the output, and thereby avoid considering other exponentially improbable hypotheses. This may relate to deeper concerns about approximate Solomonoff induction.

Second, A may be able to observe the operation of the hardware on which it is running. This will generally cause Model 1 to be much less probable: indeed, if A can observe even one “causal ancestor” of W2’s value, it will no longer gain very much by believing Model 1 as such (since now Model 1 only produces the correct output if Model 2 did anyway–all of the relative advantage for Model 1 comes from occasions when A can observe the value of W2 without observing the operations directly responsible for that value, which may be rare). Of course there are more complicated models in which A’s outputs control reality in more subtle ways, which may have better complexity. Understanding this issue completely depends on a much more detailed knowledge of A’s approximate inference and the nature of A’s observations. In general, however, being able to observe its own computation seems like it may be adequate to force A into a reductionist model.

Third, A’s approximate inference module may be aware of the fact that A’s own outputs are produced algorithmically (as a computational aid, not an underlying belief about reality). This would cause it to assign positive probability to the assertion “Model 2 is equivalent to Model 1,” and eventually force it into a reductionist model.

## Conclusion

Agents designed in the spirit of AIXI appear to be extremely fragile and vulnerable to the sort of existential despair described above. Progress on reflection is probably necessary not only to design an agent which refrains from killing itself when convenient, but even to design an agent which behaves coherently when embedded in the physical universe.

Model 2 sounds like “we damned reductionists don’t believe in free will” (and the failure mode behind poor performance by subjects exposed to reductionist accounts of the mind / free will).