Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see. In this post I’ll describe a problem with applying this approach to anything particularly important.
If we look at the simplest programs that account for your observations, once you’ve observed enough things (say, many more observations than there are bits in the description complexity of the universe), Solomonoff induction finds predictors of the form, “Simulate the universe with description D. Then look inside that universe, at the spacetime trajectory T, and return the bits encoded along that trajectory (using some encoding protocol P).” Thus, Solomonoff induction approximately reduces to what is called counting measure over observers within our own universe—the probability that you will next observer O is roughly the fraction of observers who have seen everything you’ve seen so far and next observe O. (This isn’t quite right, because observers later in time are harder to specify—it takes more bits to write down their spacetime coordinates—and so they get slightly discounted. There are a few other issues like this, but they don’t change the basic picture.)
But if you apply Solomonoff induction, and you plan to use the result of this computation for something really important, most of the probability mass is actually controlled by sequences which have been constructed artificially–either within a simulation (see the usual simulation argument) or just as a sequence of inputs written down in really big letters over and over again, for the precise purpose of manipulating the counting measure. So, for example, if I build an AI and try to incentivize it by saying “if you do the right thing, I will provide a reward,” and if that AI uses approximate Solomonoff induction, and if the AI was positioned to steal 1% of the universe by defecting (though we’ve set up things so cleverly that it would first get punished if it tried)… then we might expect some future folk to control that AI by building simulations of it, and rewarding the simulations if and only if the AI tries to defect and give its 1% of the world to those manipulative future folk. (Who could then—perhaps?—use this 1% of the world to manipulate the original AI.)
It’s a wacky situation, but I think at the kernel is an interesting phenomenon. In this post I’m just going to try to argue that this is, in fact, what would happen. To get a handle on things, I’m going to start by looking at similar situations in cellular automata.
Deterministic Cellular Automata
Suppose that we have some (modestly complex) cellular automata, which contains some observers doing science. The observers want to find a formal mathematical model for their physics (and, say, use it to direct an AI). So they take a sensor, put it at rest in the cellular automata’s preferred rest frame, let it sense for a while, and then condition the universal prior on agreement with the sensor’s inputs. Then they ask: what does that distribution predict the sensor will sense next?
What do they hope for? They hope that the distribution will predict whatever the sensor “actually” senses. So that, for example, if I gave an AI the task of making sure the future inputs on the sensor recorded us pressing an APPROVE button, the AI would really do something that would cause us to press APPROVE.
Suppose the sensor’s input is X. What short programs output X? Well, if the cellular automaton is relatively simple, one simple program will be P, defined as follows. P has the laws of physics, a mechanism for translating the sensor’s raw physical state as bits, the position of the sensor and the start time hard-coded into it, and P works by running the laws of physics, looking in the spacetime locations specified, and running the specified function to transform the sensor’s physical state into bits. If the sensor is particularly simple–for example, if it runs on a regular clock, and if there is a particular cell of the cellular automata which perfectly correlates with the sensor’s reading, then P is an extremely simple program, and generally P is much simpler than any “local” algorithm for predicting X unless X is very short or the sensor is very well isolated from the environment.
But for any other compactly specified sequence S of spacetime points, there is a different program Q(S) which outputs the sequence of observations embedded at those points. Each program Q(S) has roughly comparable complexity, including P; programs with larger coordinates are slightly more complicated, but not much more so–doubling the position of a point only increases it’s description complexity by a constant, decreasing the relevance of that point’s contributions by only a constant factor.
So the part of the universal distribution made up of the programs Q(S) is roughly a counting measure over observers within the cellular automata (time and space-discounted by something very slightly superlinear, just enough to make the sum converge). But most of the counting measure really does lie within simulations. And if someone has an incentive to manipulate such simulations, or to simply directly embed the sequence X in the universe, then their motivations will control the predicted continuation of X under the universal prior.
It seems that this may be a deal-breaker when considering the universal prior for any application of this form. If you build an AI that wants the approve button to be pressed and you use this technique to define its utility function, you aren’t going to get an agent that cares about the “real” world. Instead, you will get an agent that cares about all of the simulators which have motivations to mess with it. Maybe that leads to good outcomes–it just depends on whether values we like control most of the available computational resources (and how much of their resources would be wasted fighting these sorts of fights).
If the sensor in question is harder to specify–whether because it is traveling or timed or interpreted inconsistently, this technically makes the above argument stronger. After all, then P is contributing less to the universal prior.
However, it may be that every embedding of X in the universe is rather hard to locate, so that even an incentivized agent cannot really write down X in a way that can be picked out by a simple program. If these difficulties were severe enough, then programs Q(S) may collectively not contribute much to the measure of X. Instead, most of the mass may come from simpler “local” prediction algorithms, like those humans use to predict/compress their environment, which are ignorant about the exact state of the rest of the universe but fill in the details by guessing.
But this seems unlikely. The basic problem is that the difficulty of picking out an input sequence seems tiny compared to the difficulty of filling in all of the unknown details of the universe. The extra complexity of specifying timing info, movements of the substrate in which X is written, and whatever messy algorithm is necessary to interpret X, seem like they must be at most a modest constant (independent of length of X); once the length of X is significantly larger than that constant, the losses from guesswork will greatly exceed this constant, unless the sensor is very well isolated from the rest of the environment. Intuitively, the sensor is (at a minimum) well enough connected to the environment if its observations are enough to infer many properties of the outside world (for example, if the sensor saw enough to infer something about an observer’s knowledge of cosmology, this would give an advantage to models at large enough scales to account for those cosmological features).
For randomized automata the case is rather different. In order to pick out a sequence of observations, generically, we need to provide not only coordinates but also the entire random seed. The length of the necessary random seed is not just long, it’s much longer than any input sequence X which we might plausibly extract from the world (and more randomness is injected at each step, which guarantees that you always need more randomness than you get out information). So each program Q(S) is contributing rather little to the mass of X under the universal prior–the probability that the sequence of observations at S would be equal to X.
I strongly suspect that we attain the same conclusions in the randomized as in the deterministic case, however.
What simple programs other than Q(S) explain X? As above, the natural candidates are coarse “local” explanations, which may maintain a detailed model for the local environment, but not the rest of the universe. But it seems unlikely that the randomness of the underlying cellular automata really favors such explanations. Indeed, consider some local observations distributed under the real cellular automata. There are two cases: (1) the distribution of observations under the local predictor is the same. In this case, the local predictor is losing out for the same reason that local predictors lose out in deterministic phenomenon. (2) The distribution of observations under the local predictor is different. In this case, by definition, the average log probability assigned by the real distribution is better than that of the local predictor, so the local predictor does worse than in case (1).
What other programs might there be? I don’t know; I can’t think of any. Generically it will be impossible to rule out the possibility of some other, still shorter, algorithms to account for our observations, but in this case I don’t know why we would expect one to exist.
I think the sense that something tricky is going on here (at least, I have that sense) is coming from different intuitions being applied to “coordinates” in some exponentially large probability distribution vs. uncertainty in an algorithm like the ones humans use. Of course the two are precisely equivalent, but it feels like locating myself in a universe parametrized by an astronomically large random seed is very hard, while it feels like my local prediction algorithm is pretty good.
The case for our universe seems to be broadly analogous for a general randomized cellular automata. Yes, we are less sure about how we might encode a program which extracts observation sequences from the world, but it would be surprising if these obstructions were insurmountable, for the same reasons as given in ‘Complications’ above.
Avoiding the Trouble
The obvious way to avoid this sort of thing is to avoid the universal prior. I mentioned before the possibility of using a prior which penalizes algorithms which use a lot of time or a lot of space. This seems to rule out these particular problematic models, and force you to use a coarse local model. It is probably worth thinking a bit about how those coarse local models generically perform (say, in cellular automata).
It seems like there should be a more philosophically sophisticated way to point to “basements” than this, though; using space-bounded models feels like a bit of a hack, and throws out a lot of information. What we really care about, in general, is some measure of “generic” decision-theoretic relevance (or relevance to our personal values). We care about non-simulated copies of ourselves because they have much more influence on the world. Maybe there is some way to access this measure directly. Alternatively, it may be possible to bargain with agents as a way of explicitly eliciting predictions with respect to this measure—maybe you can’t define what “important” means directly, but you can define it implicitly by making sure those agents also care more about worlds which aren’t simulated. But this all seems to be surprisingly difficult. I think there is a very small probability that this captures a rather serious obstruction to defining things formally. (Perhaps it is easy to define something well enough when no one is looking, but once anyone cares whatever definition you use is going to get manipulated. If something like this were really the case, it would create systems with very exotic failure modes.)
10 thoughts on “The motivated simulator argument”
But humans are sensors too! How do you know that you care about the real world rather than about all the simulators which have incentives to mess with you? That should be especially difficult if you’re an AI researcher 🙂 If humans can minimize the impact of meddling simulations by using some sort of magic of subjective anticipation, then maybe we should figure out how that magic works and use it in our AIs…
1. Humans should generally expect the simulations of them to continue obeying the same laws they always have (until there is some reason not to expect it any more).
2. Humans care about their ability to influence the world, which is much larger in basements (this accounts for why, e.g., I probably don’t care so much about simulations) I think this is the principled magic that allows you to not care about simulations, and you can pull it back from UDT recommendations to subjective anticipations.
3. I think insofar as humans care about the probable continuations of their experiences, they should mostly be concerned with simulations.
Though note that if there’s a transcendent creator deity then “original worlds” aren’t obviously more important than nor distinguishable from simulated worlds, depending on the preferences of the creator deity. Given that hypothesis the stellar resources might not be “out there” even in the vanishingly small fraction of worlds where humans are the first agents to gain access to the cosmic commons.
(I’ll note that one should withhold judgment on this idea until one has seriously considered the relevant theological arguments. Surprisingly few people seem to understand theology, even though it’s clearly one of humanity’s most important fields of inquiry. Politics is the mind-killer.)
Pingback: Generalized abstract incentives problems, and simulation arms races « Ordinary Ideas
I remark that if the problem can be framed such that a crisp AI, running on a logical decision theory with a solved problem of maximization vantage points with a no-blackmail equilbrium, would regard the simulations constructed as being of negative utility, then the AI will refuse to be motivated by the simulations so as not to give the simulators an incentive to construct those simulations in the first place. The problem arises when the simulators can offer the AI something, such as the certain press of a reward button, such that the AI prefers the universes in which the simulations were constructed. It should also be kept in mind that however you construct the AI so as not to be thus motivated, the simulators are trying to find a way to make it a positive-sum bargain from the AI’s perspective regardless of your intentions, so you must be very confident indeed that the AI disprefers every possible scenario in which is simulated.
Pingback: “Indirect Normativity” Write-up | Ordinary Ideas
Pingback: Specifying a human precisely (reprise) | Ordinary Ideas
Pingback: Specifying “enlightened judgment” precisely (reprise) | Ordinary Ideas
Pingback: What does the universal prior actually look like? | Ordinary Ideas
Yeah this just seems like a variant of the general FAI problem. If the AI can be motivated by future people with an agenda offering it things, it’s unsafe to start. What this says to me is just that you can’t rely on being the sole source of payoff for your AI to ensure AI safety. Your AI has to be treating the reward button as a way for you to communicate your value function to it, not a payoff in itself, in which case the future people offering to hold down the reward button are just obvious fraud. (Of course, maybe some of the future folk are instances of you…)