# Hazards for Formal Specifications

I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human’s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of agents’ behavior):

## Counterfactuals

Suppose for example that we are in a deterministic universe with a short description. Then the shortest description of the “human’s behavior” is likely to consist of a description of the universe together with a pointer to the measuring apparatus within that universe. This description is short and correct, and will correctly predict the future behavior of the human on the actual input sequence.

But rather than learning a stand-alone description for the human behavior, we wanted to learn the abstract map which related the displayed input sequence to the human’s behavior, with an interest in examining the response of the human to counterfactual input sequences. We have now failed completely at this task: the learned function effectively has the actual input sequence wired in, computed directly from physical law. This is closely related to the issues raised in “AIXI and Existential Despair.”

In a randomized universe the analysis is less clear what happens; in general there is a risk of an analogous failure, but in a future post I will describe how to use randomness to specify counterfactual worlds.

## Attack by Simulator

If a particular sequence of observations is embedded many times in the same universe (for example, by many distinct observers having identical experiences) then we expect the universal prior to put roughly equal weight on each of those embeddings (though this is only relevant if they diverge at some point). So  if we condition on a prefix long enough that the shortest explanation comes from the embedding in the universe, our predictions for subsequent bits are roughly uniform over continuations of observers’ experiences consistent with that prefix.

We are interested in extrapolating the output sequence of a particular embedding (under appropriate counterfactuals) namely the one which we observe (and which we intuitively think o as the only embedding). But there are may be others. In particular, imagine an AI elsewhere in the universe aware of this experiment. Such an AI may create many new embeddings of these observations into the universe, with the goal of “hijacking” the resulting observation process and controlling its output (for example, it could simulate many procedures which behave identically to humans for the first part of their history and then turn over control to the AI). We can imagine a similar situation applied to humans themselves: for example, you expect your future observations to be a continuation of your past sensory experience, but if you tried to formalize the induction justifying that belief and use your expectations to specify your own world model, an AI motivated to do so could run many simulations of you up to the current moment and then modify their future experiences arbitrarily, controlling your (inductive) expectations arbitrarily.

How could such an AI exist? One possibility is interference from an alternative Everett branch in which a singularity went badly. Note that this possibility is very different from acausal trade between friendly and unfriendly AI: such an unfriendly AI gives up only a negligible share of its resources and might be able to achieve complete control over our branch.

We may be slightly protected from this possibility by an AI’s low measure or ignorance about our behavior (that is, depending on randomness and logical uncertainty which it cannot resolve we will probably pursue one of many different approaches, and a simulation attack can fool at most a few of them). We could try and exacerbate this difficulty by choosing as many parameters randomly as possible–but this would also reduce our own measure, and so would present no obstacle. Moreover, an interfering AI would be likely to have extremely considerable resources, and so be able to overcome a significant disadvantage coming from either the measure of its world or its uncertainty about our behavior. Overall, this protection does not seem substantial.

Another source for meddling AIs is our own future: if a particular uFAI is able to gain control of our choice process and uses it to take control of astronomical resources, it will then be able to use those resources to run many simulations and gain control of our choice process. This highlights a fact which seems to be relatively well understood on Less Wrong but not appreciated more broadly: a computationally bounded AI performing approximate Solomonoff induction (or even reasoning about the universal distribution) has degrees of freedom not specified by its goals and may use those degrees of freedom to exert acausal control over the universal distribution itself. For example, if the original AI decides to give control of the “human”‘s decision process to a particular uFAI, that uFAI may be able to take over the universe and then run enough simulations to justify the original decision. But this may be simultaneously true for many AIs, friendly and not, and the original AI is able to make a choice amongst these options–or it may be able to choose for there to be no simulations, or it may simply turn out that the relationship between its actions and the universal prior makes it impossible for it to make any reasonable choice (see “Beliefs That are Always Wrong”) or etc.

We could avoid this entire class of failures if the shortest description of the human’s decision process consisted of a complete description of the human’s original mental state and the physical law governing the setup. Unfortunately, if we use the Kolmogorov complexity to define “simplest” this seems unlikely. Indeed, to the extent that we as humans believe that our own recollections are enough to provide substantial evidence about the world, and to the extent that we believe that the Solomonoff prior is a reasonable model for our own predictive frameworks, we must believe that our own brains (complete with our recollections) are most concisely described by modeling the universe that produced those recollections.

In a future post, I will describe alternatives based on other complexity measures which may avoid this problem. We may also be able to escape from this problem by specifying a narrower model of computation (for example by adding explicit error to some more biologically plausible model of a brain, or etc.) and searching for a good approximation within that class.

## TDT Impersonation

Suppose our universe is a particular CA, and consider some TDT agent whose goal is to control the behavior of that CA in some way (say, a goal defined directly in terms of the states of cells, such as making the board return to its initial configuration as quickly as possible) and which implements an input output relationship defined by applying approximate TDT-style maximization.

If we were to run such an agent, using the input given to the “human in a box,” it might reason as follows:

What are the consequences of me outputting a particular O in response to this input? Well, let me simulate the universe for a while to see if I can find this input embedded anywhere. I can!

Well, now I know that the input is being used as part of a test by some humans to try and isolate a human decision process. I don’t know all of the details exactly, because this same input pattern is embedded in the universe many times, but I have a pretty good idea.

If I want to take over the universe, it stands to reason that I should pretend to be a human for a while; if I pass that test then the search algorithm that the human is using will conclude that I implement the human’s decision process, and thereby grant me some control over their world…

This reasoning may occur explicitly as our AI searches for a program to replicate the human’s output, or it may occur implicitly in the AI’s imagination as it logically reasons about the existence of such an algorithm.

In any case, if the Kolmogorov complexity of a TDT agent is shorter than the Kolmogorov complexity of specifying the embedding of the human in the universe in any other way, then such a TDT agent may be the simplest explanation for the human’s behavior. This is plausible primarily because the particular experiment we are running–for which the output may determine the entire future of our galaxy–is distinguished by the beacon of having great causal significance. The easiest way to specify “great causal significance” may be to write down an agent who is naturally driven to care about that event (or to write down a program which successfully searches for such an agent).

As with the simulation problem, we may be able to escape from this difficulty by specifying a complexity measure or class of models which favors concrete simulations of humans rather than universe-level descriptions.

## 7 thoughts on “Hazards for Formal Specifications”

1. Pingback: Formal Instructions « Ordinary Ideas

2. jsteinhardt says:

“One possibility is interference from an alternative Everett branch in which a singularity went badly.”

How would interference occur on macroscopic scales? That seems like it would be extraordinarily difficult.

“Indeed, to the extent that we as humans believe that our own recollections are enough to provide substantial evidence about the world, and to the extent that we believe that the Solomonoff prior is a reasonable model for our own predictive frameworks, we must believe that our own brains (complete with our recollections) are most concisely described by modeling the universe that produced those recollections.”

I’m confused by this. Could you indicate concretely how you would specify a brain more concisely by first describing the universe?

I also don’t really follow the TDT part at all. Why is the TDT agent being run on the input given to the human in the box? Is it supposed to be the AI that locates a human decision theory? In that case, why would it have such an arbitrary goal, and why would it choose to take over the universe?