I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I’m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an equilibrium where I say much more and feel more comfortable sharing unpolished thoughts. I think “quiet” is an OK equilibrium (most people who should read some of my thoughts shouldn’t read most of them, and it would make sense for me to try and be selective). But it seems like a suboptimal equilibrium, since there are at least a few people who do care what I think, often to better understand our disagreements.
A similar social problem is possible, where a broader intellectual community tends towards a “quiet” equilibrium in which any public speech comes with an implicit claim of interestingness and worthwhileness. I think intellectual communities vary considerably in where they sit on this spectrum. Bloggers qua bloggers often write quite a bit; academics in computer science tend to be much more guarded about what they say. I think there are some virtues to the noisier equilibrium, particularly in increasing our ability and inclination to notice, understand, and resolve disagreements.
Anyway, in that spirit, here is some of my thinking about AI—an outline of the development scenario I consider most likely, and a general discussion of the impacts of consequentialist automation. Criticism is welcome, and you can leave comments on the (quite hard-to-navigate) google doc here.
Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I’ll briefly argue:
If H is a simple hypothesis and the prior probability of H is very small, then there is nothing you could possibly see which would convince you of H.
Which suggests that this is at a minimum a very dangerous thing to do (though I’m open to the possibility that there are things you should simply never believe, no matter what you see).
The argument is basically a formalization of the platitude that you can only ever get so much evidence, because hey, it’s not that unlikely that you are crazy. In fact that claim is false, as Jacob Steinhardt was quick to point out when I made it recently. The possibility that you are crazy or in a simulation doesn’t preclude you making arbitrarily large updates. It just precludes making arbitrarily large updates in favor of simple hypotheses.
First, by “simple hypothesis” I mean a hypothesis which has a short description in whatever internal language you use to represent hypotheses. So “God exists” and “God doesn’t exist” count, so long as your brain assigns them subjective probabilities.
Second, by “prior probability” I really mean the probability you assign at some arbitrary point in time, in particular after updating on all of your introspective access. Then the result will imply that at no future time can you come to believe H, without forgetting the evidence you have already encountered.
Third, by “very small” I mean much smaller than 2-|H|. And by “much smaller” in that sentence I mean by a large factor (which doesn’t depend on H). 2100 should be plenty.
Fourth by “convince of X” I mean “cause you to believe X as the result of a correct Bayesian update.”
Finally, I’m going to make the assumption that you don’t assign negligible probability to skeptical scenarios (like the simulation hypothesis, insanity, or your experiences being an elaborate hoax). This can be justified by observing that such skeptical scenarios apply to a non-negligible fraction of observers in many reasonable worlds (for example, physical universes like ours). In fact, in many reasonable worlds skeptical scenarios apply to all observers. So in order to assign skeptical scenarios negligible probabilities, you’ve got to assign a whole lot of otherwise reasonable worlds negligible probabilities.
Having made those definitions the claim is now pretty straightforward. Consider some arbitrary evidence E you might receive, which might constitute substantial evidence in favor of H. I’ll exhibit a hypothesis Q such that P(H) << P(Q) and P(E|H) = P(E|Q), where P is your prior. This implies that P(H|E) / P(Q|E) = P(H) / P(Q) << 1, so in particular P(H|E) << 1.
Q is the hypothesis: H is false, however your experiences are manufactured according to the distribution P( * | H). Clearly then P(E|Q) = P(E|H). To see that P(Q) >> P(H), consider the claim S: a hypothesis H’ was selected uniformly at random using my internal representation of hypotheses, and then my experiences were manufactured according to the distribution P( * | H’).
If you assign skeptical hypotheses reasonable probability, P(S) should not be not too small. Moreover, P(Q|S) is at least 2-|H|. Thus P(Q) >> P(H), as desired.
Here is an argument that some people might find compelling:
It may be that the world is mad, and that as the only sane person around it falls on me to make sure we don’t all kill ourselves. If that’s the case, then my impact on the world may be huge. Let’s say that in this case, I can improve the world by 10%.
Maybe the claim that I’m the most important guy around, call it proposition P, isn’t certain. But at least there’s a good chance. Subjectively it feels like about 10%, since if I looked at 10 similarly surprising facts, I would expect one of them to be true. (I wouldn’t be that surprised to discover that I’m the most important person ever…) That still leaves me with the ability to improve the world by 1% in expectation, which looks pretty good. I might as well not even worry about stuff I could do that would improve the world by a mere 0.001%, like being an extraordinarily successful entrepreneur.
What is wrong with this argument? Intuitively, the trouble is that the prior probability of P must be quite low. Out of the 7 billion people on Earth, at most a handful can be so important. So even if you discovered evidence that suggested P quite strongly, you ought to remain skeptical. Even if a magic 8-ball which lied only one time in a million told you that you were the most influential person alive, you should still bet against it—after all, 7,000 people will hear this particular lie, while only one will be right. (Setting aside the fact that your mere possession of such an 8-ball constitutes much more than million to one evidence!)
If you have some clever argument that you aren’t “in the same reference class” as those other 7 billion people, you need to be awfully sure that it would be difficult to manufacture that argument if you weren’t in fact the most influential person. If you had a 1 in a million chance of being able to delude yourself into thinking you were special, you’d still be wrong nearly 99.99% of the time. But replies the skeptic…
Putting a prior probability of 1 in 7 billion on something plausible is ridiculous! After all, if the argument you just gave has even a 1% chance of being wrong, then I might have a prior probability of up to 1%! Do you think you could make 100 arguments that compelling, before you messed one of them up?
Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see.
So if we look at the simplest programs that account for your observations, once you’ve observed enough things (say, many more observations than there are bits in the description complexity of the universe), Solomonoff induction finds predictors of the form, “Simulate the universe with description D. Then look inside that universe, at the spacetime trajectory T, and return the bits encoded along that trajectory (using some encoding protocol P).” Thus, Solomonoff induction approximately reduces to what is called counting measure over observers within our own universe—the probability that you will next observer O is roughly the fraction of observers who have seen everything you’ve seen so far and next observer O. (This isn’t quite right, because observers later in time are harder to specify—it takes more bits to write down their spacetime coordinates—and so they get slightly discounted. There are a few other issues like this, but they don’t change the basic picture.)
But if you apply Solomonoff induction, and you plan to use the result of this computation for something really important, most of the probability mass is actually controlled by sequences which have been constructed artificially–either within a simulation (see the usual simulation argument) or just as a sequence of inputs written down in really big letters over and over again, for the precise purpose of manipulating the county measure. So, for example, if I build an AI and try to incentivize it by saying “if you do the right thing, I will provide a reward,” and if that AI uses approximate Solomonoff induction, and if the AI was positioned to steal 1% of the universe by defecting (though we’ve set up things so cleverly that it would first get punished if it tried)… then we might expect some future folk to control that AI by building simulations of it, and rewarding the simulations if and only if the AI tries to defect and give its 1% of the world to those manipulative future folk. (Who could then—perhaps?—use this 1% of the world to manipulate the original AI.)
It’s a wacky situation, but I think at the kernel is an interesting phenomenon. In this post I’m just going to try to argue that this is, in fact, what would happen. To get a handle on things, I’m going to start by looking at similar situations in cellular automata.
Deterministic Cellular Automata
Suppose that we have some (modestly complex) cellular automata, which contains some observers doing science. The observers want to find a formal mathematical model for their physics (and, say, use it to direct an AI). So they take a sensor, put it at rest in the cellular automata’s preferred rest frame, let it sense for a while, and then condition the universal prior on agreement with the sensor’s inputs. Then they ask: what does that distribution predict the sensor will sense next?
What do they hope for? They hope that the distribution will predict whatever the sensor “actually” senses. So that, for example, if I gave an AI the task of making sure the future inputs on the sensor recorded us pressing an APPROVE button, the AI would really do something that would cause us to press APPROVE.
Suppose the sensor’s input is X. What short programs output X? Well, if the cellular automaton is relatively simple, one simple program will be P, defined as follows. P has the laws of physics, a mechanism for translating the sensor’s raw physical state as bits, the position of the sensor and the start time hard-coded into it, and P works by running the laws of physics, looking in the spacetime locations specified, and running the specified function to transform the sensor’s physical state into bits. If the sensor is particularly simple–for example, if it runs on a regular clock, and if there is a particular cell of the cellular automata which perfectly correlates with the sensor’s reading, then P is an extremely simple program, and generally P is much simpler than any “local” algorithm for predicting X unless X is very short or the sensor is very well isolated from the environment.
But for any other compactly specified sequence S of spacetime points, there is a different program Q(S) which outputs the sequence of observations embedded at those points. Each program Q(S) has roughly comparable complexity, including P; programs with larger coordinates are slightly more complicated, but not much more so–doubling the position of a point only increases it’s description complexity by a constant, decreasing the relevance of that point’s contributions by only a constant factor.
So the part of the universal distribution made up of the programs Q(S) is roughly a counting measure over observers within the cellular automata (time and space-discounted by something very slightly superlinear, just enough to make the sum converge). But most of the counting measure really does lie within simulations. And if someone has an incentive to manipulate such simulations, or to simply directly embed the sequence X in the universe, then their motivations will control the predicted continuation of X under the universal prior.
It seems that this may be a deal-breaker when considering the universal prior for any application of this form. If you build an AI that wants the approve button to be pressed and you use this technique to define its utility function, you aren’t going to get an agent that cares about the “real” world. Instead, you will get an agent that cares about all of the simulators which have motivations to mess with it. Maybe that leads to good outcomes–it just depends on whether values we like control most of the available computational resources (and how much of their resources would be wasted fighting these sorts of fights).
Complications
If the sensor in question is harder to specify–whether because it is traveling or timed or interpreted inconsistently, this technically makes the above argument stronger. After all, then P is contributing less to the universal prior.
However, it may be that every embedding of X in the universe is rather hard to locate, so that even an incentivized agent cannot really write down X in a way that can be picked out by a simple program. If these difficulties were severe enough, then programs Q(S) may collectively not contribute much to the measure of X. Instead, most of the mass may come from simpler “local” prediction algorithms, like those humans use to predict/compress their environment, which are ignorant about the exact state of the rest of the universe but fill in the details by guessing.
But this seems unlikely. The basic problem is that the difficulty of picking out an input sequence seems tiny compared to the difficulty of filling in all of the unknown details of the universe. The extra complexity of specifying timing info, movements of the substrate in which X is written, and whatever messy algorithm is necessary to interpret X, seem like they must be at most a modest constant (independent of length of X); once the length of X is significantly larger than that constant, the losses from guesswork will greatly exceed this constant, unless the sensor is very well isolated from the rest of the environment. Intuitively, the sensor is (at a minimum) well enough connected to the environment if its observations are enough to infer many properties of the outside world (for example, if the sensor saw enough to infer something about an observer’s knowledge of cosmology, this would give an advantage to models at large enough scales to account for those cosmological features).
Randomized Automata
For randomized automata the case is rather different. In order to pick out a sequence of observations, generically, we need to provide not only coordinates but also the entire random seed. The length of the necessary random seed is not just long, it’s much longer than any input sequence X which we might plausibly extract from the world (and more randomness is injected at each step, which guarantees that you always need more randomness than you get out information). So each program Q(S) is contributing rather little to the mass of X under the universal prior–the probability that the sequence of observations at S would be equal to X.
I strongly suspect that we attain the same conclusions in the randomized as in the deterministic case, however.
What simple programs other than Q(S) explain X? As above, the natural candidates are coarse “local” explanations, which may maintain a detailed model for the local environment, but not the rest of the universe. But it seems unlikely that the randomness of the underlying cellular automata really favors such explanations. Indeed, consider some local observations distributed under the real cellular automata. There are two cases: (1) the distribution of observations under the local predictor is the same. In this case, the local predictor is losing out for the same reason that local predictors lose out in deterministic phenomenon. (2) The distribution of observations under the local predictor is different. In this case, by definition, the average log probability assigned by the real distribution is better than that of the local predictor, so the local predictor does worse than in case (1).
What other programs might there be? I don’t know; I can’t think of any. Generically it will be impossible to rule out the possibility of some other, still shorter, algorithms to account for our observations, but in this case I don’t know why we would expect one to exist.
I think the sense that something tricky is going on here (at least, I have that sense) is coming from different intuitions being applied to “coordinates” in some exponentially large probability distribution vs. uncertainty in an algorithm like the ones humans use. Of course the two are precisely equivalent, but it feels like locating myself in a universe parametrized by an astronomically large random seed is very hard, while it feels like my local prediction algorithm is pretty good.
The case for our universe seems to be broadly analogous for a general randomized cellular automata. Yes, we are less sure about how we might encode a program which extracts observation sequences from the world, but it would be surprising if these obstructions were insurmountable, for the same reasons as given in ‘Complications’ above.
Avoiding the Trouble
The obvious way to avoid this sort of thing is to avoid the universal prior. I mentioned before the possibility of using a prior which penalizes algorithms which use a lot of time or a lot of space. This seems to rule out these particular problematic models, and force you to use a coarse local model. It is probably worth thinking a bit about how those coarse local models generically perform (say, in cellular automata).
It seems like there should be a more philosophically sophisticated way to point to “basements” than this, though; using space-bounded models feels like a bit of a hack, and throws out a lot of information. What we really care about, in general, is some measure of “generic” decision-theoretic relevance (or relevance to our personal values). We care about non-simulated copies of ourselves because they have much more influence on the world. Maybe there is some way to access this measure directly. Alternatively, it may be possible to bargain with agents as a way of explicitly eliciting predictions with respect to this measure—maybe you can’t define what “important” means directly, but you can define it implicitly by making sure those agents also care more about worlds which aren’t simulated. But this all seems to be surprisingly difficult. I think there is a very small probability that this captures a rather serious obstruction to defining things formally. (Perhaps it is easy to define something well enough when no one is looking, but once anyone cares whatever definition you use is going to get manipulated. If something like this were really the case, it would create systems with very exotic failure modes.)
(Apologies: this post will probably be incomprehensible, at least to readers who don’t have introspective access to my thoughts.)
In my post on Indirect Normativity, I describe the definition of a very complicated utility function U. A U-maximizer has a very interesting task: it is supposed to optimize a utility function that is completely opaque, and it is supposed to care about worlds (alternatively, assign worlds prior probabilities) according to a metric which is too complicated to understand! A U-maximizer may be able to understand the implications of the definition after amassing a great deal of computational power, or by understanding the process that generated the definition, but at first it would have no idea what U was. And if you are a U-maximizer, and you don’t know anything about what U is, you might as well behave randomly. This is problematic for two reasons:
- We’d prefer get sane behavior as soon as possible, rather than waiting until a U-maximizer amasses the computational power necessary to understand U. If U says “don’t kill anyone!” and you can’t figure out U until you’ve murdered a million people, that could be considered problematic. We’d like to build agents that can intuitively reason “killing someone might be bad” without being able to figure out there own utility function in complete detail. But in this setting, not only can the U-maximizer not figure out exactly what U is, without doing really sophisticated mathematics they can’t tell anything at all about it.
- Any given procedure for acquiring more computational resources must rely on assumptions about what world you are in. A certain strategy will work in some worlds, but not work in other worlds. And if which worlds you care about is encoded in U, and you can’t figure out U, how do you know to take actions that help you compute things in simple worlds? Why not take actions that would be useful for computing things in other, perverse worlds? Do we need to have something beyond a prior, if we want to build an agent whose prior is too complicated for it to initially understand?
Here I’d like to introduce and discuss a toy example, which I think gives a clearer view of these issues.
Encrypted Utility Function
Suppose we have a simple algorithm U which works by sampling universally distributed strings and scoring them, and a one way function f (that is, a permutation such that it is easy to compute f(X) given X, but hard to compute X given f(X)). Let V = f(U, x) for a randomly chosen x, and define the utility function U’ as follows: “Do a brute force search to find U, x such that f(U, x) = V, then return U().”
Now we record some input data S, and we run the algorithm A(S): “Output X such that, conditioned on A(S) = X, the expected value of U’ is maximized.” (In order to compute the conditional probabilities, A makes use of some sort of “math intuition module.”) We set up the computational environment such that A’s output will be sent to actuators, and then the inputs of the sensors will be appended to S to produce S’, and we will repeat the whole process with A(S’).
Suppose that Solomonoff induction on the input data S is sufficient to infer a lot about the world containing A, including the fact that the code of A was written by some humans, who produced U’ by applying f to U along with some random seed x (which can’t be inferred from S).
What will A do, in this world?
Reasoning About Incomprehensible Utility
I suspect that, if A has a reasonable (and reasonably reflective) math intuition module, it should reason as follows (the rest of this section is from A’s perspective):
Probably U’ doesn’t depend at all on A(S)—after all, most things don’t. So, given that I find myself with an incomprehensible utility function, probably it doesn’t matter at all what I do. But I might as well ignore those cases, and just assume for the moment that U’ does depend on A(S). So: what then?
Well, based on S (which encodes many observations of the environment) and cleverness, I have noticed that there is a simple process W that explains where the code of A(S) came from. Namely, W is the process in which some humans (whose existence and properties can be inferred from S) start with the string Q (which I can also infer from S), generate x at random, and then outputting f(Q, x). Let Property P be the assertion: “There exists some x such that f(Q, x) = V.”
So now I am tasked with explaining: “Why is it that U’( )” depends substantially on A(S)? (I’m assuming that U’ does depend on A(S), because otherwise it doesn’t matter what I do.) We need to be a bit more careful about what we mean by “depends on”: what I mean by “U’( ) depends on A(S)” is that E[ U'() | A(S) = 0 ] is quite different from E[ U'() | A(S) = 1 ], which is itself a logical fact. Similarly, by “U’( ) depends on W( )” I mean that U’ depends on the computational history of W( )”.
One possible explanation for why U’( ) depends on A(S) is Hypothesis H:
U’( ) depends on W( ), AND Property P is true, so U’( ) depends on A(S). [This generically implies that U'( ) depends on A(S).]
Now we need some notion of a “logical prior,” which should be clear enough intuitively. We have four facts whose probabilities we are interested in:
- U’( ) depends on A(S).
- U’( ) depends on W( ).
- Property P.
- Hypothesis H.
My claim is that the “prior probability” of a short statement should be much higher than the prior probability of a long statement, unless there is some other implication which causes the long statement to be true. I.e., the prior probability of (2) should by default be higher than the prior probability of (1). The prior probability of Property P should clearly be |V|/|x|. The prior probability of Hypothesis H should be about prior probability of (2) * the prior probability of Property P.
So if the right relationships amongst the sizes of things holds, (1) is much more likely to be true as a consequence of (2), then “for it’s own sake.” (Of course there could be other, still better, hypotheses that explain why (1) was true, but let’s set those aside.)
And if we buy that, then if we condition on the worlds where (1) is true, we infer that (2) is probably true, property P is probably true, and U’( ) = U ( ).
The Analogy
In the case of the indirect definition of the human utility function, as in the case of the encrypted utility function, the mathematical definition of U is completely opaque until astronomical resources are available. However, in both cases a U-maximizer may reason about the physical process that gave rise to the existence of the U-maximizer, and in particular U, by treating the existence of such a physical process as a possible explanation for why a completely unknown U could depend on A(S). If the process carrying out that reasoning is philosophically sophisticated and conforms to intuitive notions of correctness then it is possible to learn about the utility function without having astronomical resources.
Suppose we found a black box which purported to be a halting oracle. You feed the box an input, and it outputs 0 or 1. As it turns out, we couldn’t verify such a claim. Nevertheless, we could tell that such a box was pretty magical. Exactly how much could we tell? That is, if you ask the box a question of the form “Does Turing machine M halt?” when can you verify that its answer is accurate? Note that, in order to do this verification, you may be able to ask the box some further questions. In general, you could use halting queries to make the box simulate a deterministic prover’s role in some interactive proof protocol (i.e., does the machine that finds the lexicographically first valid prover strategy and then halts iff the prover’s strategy outputs 1 at this stage, halt?)
If all you care about is computability, you can’t do anything. If you could learn something interesting from an untrusted halting oracle, you could have just looped through all strings until you found one that convinced you to trust it. The interestingness of the situation comes from a computational complexity perspective.
[Disclaimer: I don't think this is useful.]
There are a few parameters that affect the answer:
- Can we make classical or quantum queries to the box? This actually ends up not mattering, but that fact surprised me, and the reasons aren’t simple.
- Do we care about being completely certain when making claims about the box, or is pretty certain good enough?
- Do the box’s answers depend on other things that have happened in the universe / other queries made to the box, or are we guaranteed that the box would always answer the same question the same way?
Here are the main results:
1. In polynomial time, we can verify any claims which can be proven in polynomial time.
For example, if we ask the box “Does the Riemann Hypothesis have a proof of length < 1 billion?” we can verify an answer of “yes” by asking “Does it have a proof of length < 1 billion that starts with ‘a’? No? What about ‘b’? Yes? Well, does it have a proof that starts with ‘ba’? No? What about ‘bb’? What about ‘bc’?” and so on. This is pretty uninteresting–either we get out a contradiction, or we get out a proof.
2. The box’s claim about the result of any computation which takes a polynomial amount of space (for example, determining who wins from a given position in chess) can be verified in polynomial time, with only a negligible probability of erroneously believing a false claim.
This is a rather difficult proof. It follows from the fact that IP = PSPACE, a celebrated result in complexity theory (which says that it is possible to interactively prove in polynomial time anything which can be verified in polynomial space.) The key insight is that if someone claims that a sentence is true both for X=0 and X=1, I can efficiently test this claim by defining a polynomial which encodes the truth of the original sentence, asking them to tell me the value of that polynomial on some parametrized curve passing through X=0 and X=1, and then verifying the correctness of the polynomial representation at a random point along that curve. This works because, if two polynomials differ at any one point (so, if the polynomial representation is unfaithful either at 0 or 1) then those polynomials differ at almost every point (so I only have to check one point, to find the prover’s lie with high probability). Also, it’s important to note that polynomial space computation corresponds exactly to quantified boolean formulas of polynomial size.
3. Suppose the box’s response to a query depends only on that query, and not on the other queries that have been made to the box / other stuff that has happened in the universe. Then you can test the box’s claims about any computation which takes exponential time, with only a negligible probability of erroneously believing a false claim. In fact, you can test any claim which could be proven by an exponentially long proof.
This is a still more difficult proof. It follows either from the PCP theorem or from the proof that MIP = NEXP (Which says that two provers who can’t communicate with each other can convince you in polynomial time of anything you could verify yourself in exponential time.) In fact, the PCP theorem says that after a constant number of queries to the box, you can verify its claims with only an arbitrarily small constant probability of being mistakenly convinced by false claims. This is a very nice result, and one of the most important in theoretical computer science, but I definitely can’t do it justice here.
(Well: you can express the satisfiability of an exponentially large proof checker as an exponential system of polynomial equations. If you want to verify an arbitrary list of polynomial equations, you can verify just a constant number of random linear combinations of those equalities. By using indirection, and proving things about the behavior of proof checkers (who are themselves checking proofs about still other proof checkers…) rather than attacking the statement you care about directly, you can reduce the amount of randomness required to validate such a proof. If you iterate this amplification step, you can get down to the point where the checker sends the prover only logarithmically much stuff, and the prover sends back only constant stuff, in order to check a proof of linear size, which is pretty wild.)
4. You can verify a claim made by the box with high probability if and only if that claim has an exponentially long proof.
This is pretty straightforward. In polynomial time you can only access exponentially many of the box’s possible answers (if you only have time to write down N bits, there are only 2^N possible questions you could ask). If you can check the box’s claims about something, then the list of its answers to all exponentially many questions constitutes a proof.
Of course, for anything that can’t be proven in exponential time, a similar argument shows that you don’t care about the answer, even in a world with a halting oracle. So, the limitations to our ability to recognize halting oracles don’t seem too severe.
