Suppose we could build agents with arbitrary formally specified preferences, but that specifying human values turns out to be too challenging. A second approach to getting what we want is to give an agent relatively limited goals and capabilities, hopefully encouraging cooperation and helping us build more cooperative agents, or figure out how to better specify what exactly we want. If the agent is much smarter than us and perfectly selfish it may be a bit tricky; but if the agent is limited in some other way it may not be impossible. For example, you could try to build an agent which only cared about getting a cookie in the next hour, for some definition of “cookie” such that you are the only person who owns any cookies. Or you could try to build an agent which knew and cared only about abstract mathematical facts, and come up with useful problems it could solve in that domain. Right now I want to talk about the first approach, and highlight some difficulties.
Let’s be as generous as possible. Suppose that you can train a human so that they can recognize a useful, great project when they see it. In particular, the human never gets fooled by a terrible idea masquerading as a sane one, nor intimidated by death threats, and so on.
If we could codify the test this person is doing as a formally defined function F, we could just make an agent whose preferences were “Maximize the probability that whatever you say will be accepted by F” and be done with it. But suppose that we can’t codify the procedure, we can just define it implicitly by training the human and proposing possible plans to them.
So we provide lots of examples and apply Solomonoff induction, and we hope that this will give us an explicit formal definition F of this implicitly defined test. After all, whatever the human is doing it has a description that isn’t too complicated (a human’s brain only has so much complexity), and if you provide lots of other data about the human to be learned at the same time, the additional complexity of specifying the procedure the human is doing is probably quite small. We’ll then give our agent the goal “maximize the probability that whatever you say will be accepted by F” and hope that it spits out something that would be recognized as a good plan—under our assumption that the human is perfect at recognizing a good plan, this is the desired outcome.
Unfortunately it doesn’t look like such an agent will cooperate at all if we asked it to do something important (say such that if it didn’t cooperate, the balance of power on Earth would be non-negligibly altered). The trouble is that such an agent can be manipulated by applying even relatively modest amounts of computational resources towards creating appropriate simulations. Intuitively, we control one copy of the agent and can incentivize it for providing a good plan, but out there in the universe there may be many simulated copies of the agent, who will be incentivized for non-cooperation.
(I described the proposal in terms of learning the discrimination function the human was using, but it should be clear that this is precisely equivalent to asking the agent to propose a plan which will lead to our approval, and asking it to use Solomonoff induction to make predictions about what will lead to our approval.)
The problem is that we’ve tried to use one apparent asymmetry to our advantage. Namely: we control the agent’s environment, and so have the freedom to administer or withhold approval. But this asymmetry was only apparent, because virtually all copies of the agent are in simulations.
There is a second asymmetry we might exploit: namely, by hypothesis there are outputs the agent could give that are useful (ie good plans), but none it could give that are destructive. The worst it can do is be quiet. So if we could just encourage a sufficiently broad range of behaviors, then some of them would be good, none of them would be bad, and we’d (eventually) get what we want. This is intuitive—it may be hard to get a pathological subordinate to do what you want if he has the opportunity to kill you and run away, but it feels much easier if the worst he can do is just sit there and wait for you to try a different approach.
So the second strategy which comes to mind is this. Rather than rewarding the agent directly, make the agent care about something arbitrary which we can use to bargain with it. For example, instead of making the agent care about local rewards, make it care about the number of copies of itself that will exist in the future (ie., give it it’s own description and ask it to minimize the Kolmogorov complexity of that description). If we eventually spread out throughout the universe, we commit to spending 1% of our resources building copies of the agent.
Of course, a competing civilization which controlled an analogous amount of computing power (for example, our competitors here on Earth who are currently hoping we will fail, to slightly increase their chances of prevailing) could commit to spending 2% of their resources simulating the agent. So why should it cooperate with us, rather than cooperating with our adversaries.
The important observation is that we can run a whole sequence of agents, and if any of them cooperate with us by outputting a useful plan, then we will triumph. Conversely, all of them must cooperate with some other civilization in order to thwart us. Write Q for the ensemble of all of the other civilizations, who are trying to mess with us. Let’s assume that our future computational power is comparable to the computational power of Q, say within a factor of 1000. Write M for the number of copies of the agent we are in a position to create, and 1000M for the number of copies our adversaries in create.
(In general we should be talking about measure rather than computational power, since in fact we are probably engaged in acausal competition as well. The universal prior doesn’t care whether the simulations are in our universe, or a hypothetical universe we could imagine. The reason that the hypothetical universes don’t dwarf our contributions, despite their greater number, are simply that any given hypothetical civilization could spend at most a negligible fraction of its time simulating us, just one of the astronomically many possible civilizations.)
So when we run the first agent, call it A1, Q needs to give A1 at least 0.01M to stop A1 from cooperating with us. If that happens, then we have no outstanding commitment to A1, and so we decide to run A2. Similarly, Q needs to give A2 at least 0.01M to stop it from cooperating with us. (Note that we are relying heavily on the fact that there is no way Q can incentivize A1 to screw us over–all it can do is incentivize A1 to remain silent.) If we repeat this proces 200000 times, Q must give 0.01M to each of A1, A2, …, A200000, in order to prevent them all from cooperating. But M only had 1000M to give away, and so at this point it looks like Q is out of luck, and one of these agents will be incentivized to cooperate with us.
So all is well, right?
So before we considered A1, A2, …, making their decisions sequentially, and argued that each one would cooperate with us unless it stood to gain 0.01M from Q by being non-cooperative. But we need to be a bit more clear about how these agents are really making their decisions. We described these agents as caring about a mathematical quantity—how often do they appear, throughout the mathematical multiverse? The only framework I know for thinking about such questions is evidential decision theory. Make the decision you are happiest to learn that you’ve made, rather than the decision that has the best direct consequences. As long as you are merely uncertain about mathematical facts, this lets you make decisions on the basis of their consequence about those mathematical facts—even though in a metaphysical sense they are fixed, and the normal notions of causality don’t apply. (This isn’t an aesthetic choice—I also know causal decision theories that work in this setting, but in fact they have exactly the same surprising property we are about to observe.)
In this framework, an agent never accepts its observations as certainties. Instead, when it takes a certain action, it updates on the fact “After observing X I did Y.” Typically this fact will be most relevant in worlds where X is true, since it is only in those worlds where it matters what the agent does after observing X. But sometimes, it matters what we do.
For example, consider a hitchhiker who is offered a ride by a passing stranger only if he will pay $10 after arriving in town. Unfortunately for the hitchhiker, the stranger is a perfect lie detector. If the hitchhiker reasons only about the direct consequences of his actions, upon arriving in town he reasons “Now I’m in town, and I stand to gain nothing by taking $10 out of the bank and giving it to the kind stranger.” The stranger can see this coming, however, and doesn’t offer the hitchhiker a ride. But if the hitchhiker is an evidential decision theorist he reasons: “If it is the case that I leave without giving the stranger $10 after arriving in town, then it is the case that the stranger will predict this, and I’ll be stuck without a ride. So I’d prefer it be the case that I give the stranger $10.” And so the evidential decision theorist pays the stranger despite losing $10 and already being safe and sound.
Returning to our original scenario, we’ve committed to assigning 0.01M to helping out whichever agent cooperates with us, and we supposed that to thwart us Q has (acausally) committed to assigning 0.02M to helping out each and every A who defects. Unfortunately, that isn’t right—Q can actually incentivize defection much more cheaply. Suppose Q commits to making 2^-k * 0.02M copies of Ak if (and only if) every single agent A1, A2, …, defects. If any agents cooperate it does nothing.
Now consider the decision A1 must make. If A1 expects any agents to cooperate, it should cooperate, since it stands to gain nothing by defecting. But if A1 expects every agent to defect, then it can get 0.02M copies by defecting and so does so. Now suppose A1 defects, and consider the decision A2 must make. A2 reasons: “If it is the case that I cooperate after A1 defects, then A1 will cooperate preemptively, and I receive nothing. But if it is the case that I defect in this situation, then A1 will defect and I will receive 0.01M, which is better than nothing. So I should defect.” Then analogously A3 reasons “If I cooperate I get nothing, because both A1 and A2 would have cooperated preemptively if they predicted I would cooperate. So I better defect to receive 0.005M!” and so on.
In the end, none of the A’s cooperates.
Of course, this isn’t the reasoning the A’s will really follow. But the general principle is that if you have a bunch of evidential decision theorists, and there is any way for all them to do better by cooperating with each other than by cooperating with you, they’ll cooperate with each other. (Unless you can keep them in the dark about the real situation, which is a possibility but which seems quite dangerous if they are generally well-informed and much smarter than you.)
The moral seems to be: if you want to use this sort of mechanism design to prevent agents from being coerced by simulators, you should use dumber agents or some more kludgy decision theory which can be manipulated in this way.