Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of “maximize the extent to which I would approve of your decision upon reflection.” (I have suggested that we would be happy with a powerful AI that made decisions according to this maxim.)
I have written about this issue in the past; in this post I’ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I’m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.
I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.
A note on induction
In this post I’ll talk about the universal prior and Solomonoff induction; I don’t think anything in the post really depends on how we approach induction, but having a particular model in mind will help the discussion be more concrete. If our goal is just to produce any definition then we are free to use whatever kind of induction we like so long as we can define it formally; in other contexts it might be important to use a flavor of induction which is compatible with the architecture of a machine intelligence. But again, I think that everything in this post should be very broadly applicable.
A precise enough specification for human consumption
First, the goal: I would like to define an interactive machine f : (Stream of inputs) → (Stream of outputs) which captures my input/output behavior. That is, I should be able to call f(“What is your favorite color?”) and get the output “Grey,” at least for some appropriate encoding of the question and answer. I should also be able to simulate interactions using f. For example, if I had a function f capturing my input/output behavior and a function g capturing your input/output behavior, I should be able to connect them up appropriately and simulate a conversation between us.
So the first step of the proposal is to isolate some input and output channels. These aren’t something that a human “comes with” naturally, and so we have to make a substantive decision about what format the input and output will take.
One choice that seems particularly attractive from an engineering perspective is to use a computer interface. That is, we can imagine a computer (which I’ll call the “terminal”) implementing the following process:
- Take some bits from the outside world. Interpret these as what to display on the screen. You can do this 60 times a second or whatever. (We could include audio as well in the same way, or other output modalities.) I’m going to call these the outputs of the terminal. I’m sorry if this is confusing, but the alternative is also confusing.
- Whenever you get input from the human, feed these back to the outside world. This input could be in the form of key presses, or recordings from a microphone, or images over a video camera, or whatever other input device we can connect to a computer. Similarly, we can do this 1000 times a second or whatever. I’m going to call these the inputs to the terminal.
Now if we imagine me sitting at a computer running this program, we have a function f: its inputs are the bits being fed to this computer by the outside the world (which are then displayed on the screen), and its outputs are the bits being fed back by the computer to the outside world (which describe the things I typed). Note that the inputs of f are the outputs of the terminal, and the outputs of f are the inputs to the terminal. Sorry!
Indeed, I could use this function f to ask a question like “what is your favorite color” (for example by feeding it inputs leading it to display “What is your favorite color?” on the screen) and then get back an answer. I could connect up two copies of f, together with some intermediate processing, and simulate a video call between two copies of myself. And so on.
My responses depend not only on what I have seen from the terminal, but also everything else that is happening in the world. In the original proposal I suggested encapsulating the whole system in a well-insulated box, to make the external influences as simple as possible (and make it as easy as possible to model the contents of the box without thinking of the rest of the world).
This time, I am not going to make that recommendation. Even if I allow the rest of the world to influence me, there is still some function f describing the input/output behavior of the system. It may be a more complicated system, but it is also one which better reflects the normal conditions under which humans operate. It also has other advantages: you can get food and other amenities, you can have a room bigger than a casket, and maybe you can even have periodic contact with the outside world to remain sane. That said, this concession also introduces some extra complications, which I’ll mention at the end of the post.
The timescales involved here need not be too long; I have in mind something like a day. Because we can re-run the simulation many times, and ultimately craft an improved definition, having only one day of contiguous intellectual work is not as significant drawback as it appears. That said, it may be useful to work with much longer timescales such as weeks or months, and I don’t see immediate problems with doing so.
So far I’ve been talking about one person, but we could just as well talk about a whole group of people. We can also allow them to inhabit a larger environment rather than imagining them constrained to sitting at a computer, we could provide them with more resources in that environment, we could give them several terminals, and so on.
A much more precise specification
Having specified what exactly we mean by the input/output behavior a person (or some people) implement to our own satisfaction, it remains to specify this input/output behavior in a way that can be communicated to the intended audience.
If we are talking to someone who shares a basic conceptual language with us, we might just be able to give them a description like the one above. I think that many readers will at this point have the same image in mind that I do, modulo some details. I could also imagine machines that would in a similar sense be able to communicate with me. Of course there are still many details of the setup which we would need to specify, for example by picking a particular venue, particular people and states of mind for those people, protocols for interaction, etc. However, that could all be specified extremely unambiguously by actually setting up the physical experiment: gathering the appropriate people to the appropriate place, setting out the computers that will be used, choosing a time, etc. Moreover, I think we should endeavor to use these concepts in a way which is robust to imprecision, and so I think it is plausible we could use a natural language description that didn’t involve physically setting up the experiment or specifying every detail.
More pessimistically, we might imagine trying to communicate this idea to a more alien mind, which either lacks the conceptual framework we use to communicate these ideas to each other, or which has this conceptual framework but lacks a common language to talk about it. I think that even in this case we can provide a relatively satisfactory definition. I find this particularly comforting; it suggests to me that for a very broad range of machine intelligences, we will be able to communicate this proposal.
I think that this more pessimistic assumption is more likely to be warranted if we are interested in using these concepts to define the goals of an agent, because I think that normal approaches to communication between agents may break down in this case.
In the pessimistic setting, I want to use a two-step strategy for communicating this concept:
- Acquire some training data, which I think is sufficient to pin down the concept.
- Specify how the concept relates to the training data, and how to obtain the concept using induction on the training data.
The training data
I want to include two kinds of training data.
The first is a huge stock of observations of the world, of the kind that we might want our AI to have anyway in order to build a good model of the world. These might be passive observations, or it might be important that they include interactions between the AI and the world which allow it to actively improve its model. The assumptions I want to make of this data are that (1) it is sufficient to allow the AI to build a pretty good model of the world, and (2) it includes observations of the individuals and environment which will be used in the final construction.
In practice I think it is quite possible that you would not even have to include the first set of training data explicitly, because an AI reasoning about the world will already have a wealth of such data and will reason in such a way that it automatically influences its inferences about a new batch of training data. But if we don’t have that data already or it wouldn’t automatically affect how induction about a new concept worked, we should include it explicitly.
The second class of data is used to directly define the input-output behavior f, as defined in the first section. It is defined in the same way as in my original proposal. That is, to collect it we will do the following:
- We will actually get together the people or person that we want to model, and bring them to the intended venue under the intended conditions. They will actually spend a day (or whatever) interacting with the intended terminal (which would probably be an actual computer).
- We will choose what data to provide them over that computing terminal. For example, we might choose to have a video call with them. We might choose to let them interact with a programming environment, or answer moral questions, or whatever. Ideally we would include a wide variety of kinds of interaction, and for safety’s sake we would make sure that there was sufficient stochasticity in our choice of data.
- We would record every output that was displayed on the terminal, and all of the inputs entered on the terminal. These will comprise the second set of training data.
The hope is that a learned mapping from (Terminal outputs) to (Terminal inputs) will be precisely the desired function f. That is, the learned mapping will predict what the people would have typed, if something different had been displayed on the terminal.
We might also try including some terminal outputs that are likely to rattle participants or simulate aspects of relevant counterfactuals. This allows us to review their behavior at some length (to make sure we think they would have behaved well under the appropriate counterfactual) before actually deciding to use the training data in an important application. This could be facilitated by making the counterfactuals we consider look as similar to those used in training as possible when necessary.
The purpose of including stochasticity in the choice of data is to ensure that the map f has to actually make use of the (Terminal outputs) in order to compute (Terminal inputs), rather than just figuring out the (Terminal inputs) directly from physics. This is probably not necessary, given that the universe is pretty stochastic anyway, but it seems relatively cheap. To the same end, it would be nice to collect our first training data before generating the second training data, so that (Terminal outputs) are uncertain even given the first training data.
Now we have two pieces of data: a set of observations O about the universe, and a pair of sequences (Outputs, Inputs) from the terminal. We would like to extract a map f that correctly reproduces the counterfactual outputs of the people in the room, if they had been shown a different set of inputs.
The idea is that a parsimonious explanation of O will hopefully involve a good representation of all of the stuff happening in the environment. So if we have in hand a parsimonious explanation of O, the observations about f can be very simply described by just describing where and when the terminal is. More informally, for an agent who already knows what’s going on in the world, it should be able to quickly notice that the second set of training data is just what would be typed into a particular terminal in response to a particular output being displayed on the terminal.
More formally, let P be the simplest program (or a simplicity-weighted mixture of programs) such that:
- On input 0, it outputs the set of observations O.
- On input 1 followed by a list of [Terminal outputs], it outputs the next [Terminal input], i.e. it outputs the thing that the humans type after observing a given sequence of screenshots.
Our implementation of f is then given by the behavior of P on input 1.
The purpose of requiring P to output O on input 0 is to force it to build up a parsimonious representation of its environment. This hopefully ensures that the second set of training data can be described by an extremely simple modification to this program, one that simply picks out the terminal from that environment.
This is intended to patch the original proposal’s difficulty with acquiring enough information about a person to uniquely determine them. Now most of the information required to specify me is actually being obtained from O, and the the data from the terminal is only being used to pinpoint the input/output channel. I don’t know whether it is possible to know everything about me based on all of the recorded data about the world, but I think it is possible to know enough. First, you can infer the basic details of what I value and how I would behave, which I expect is good enough. Second, across the ensemble of all worlds like this one (i.e. with the same recorded history), each version of me gets the right total amount of influence. I think this is probably good enough, even assuming (I think somewhat implausibly) that there were big value divergences between versions of me in very similar worlds.
A significant concern with this procedure is the one I discussed before, essentially that Solomonoff induction might end up believing the simulation argument and so this definition might be manipulated by motivated simulators. I now think this is somewhat less likely than I did a year ago, and overall I think that we have a good enough suite of options to avoid this issue if we are aware of it. For example, the proposal of finding efficient, simple predictors rather than merely simple predictors, seems like it might be able to avoid predictors which work by locating us in the physical universe, and instead focus on predictors which build a model of their environment.
Vladimir Slepnev’s and Wei Dai’s concerns about value changes continue to afflict this proposal. That is, it may be that when someone interacting with the terminal in a counterfactual would have selfish preferences about what happens in the counterfactual rather than the real world, or that they become quite confident that they are in a simulation and become concerned with how the simulator will respond to their decisions. I imagine that psychologically speaking, a more conventional environment would slightly alleviate these problems, but only very slightly. One consideration on our side when dealing with these difficulties is that they don’t involve dealing with unknown challenges, they just involve understanding people’s responses to an unprecedented but predictable situation.
Difficulties from interacting with the world
In contrast with my original proposal, this specification for “what I would say” actually depends on what is happening in the rest of the world. For the most part I think this is fine, and as mentioned it comes with some advantages. It introduces a few new difficulties though.
Namely, if “what I would say” is what would get typed on an actual physical terminal that exists in the world, then it can be subject to manipulation by anyone who cares about what I would say and could get access to the terminal. This is not intrinsically problematic, but it does mean that I should take care to ensure that e.g. an AI who cares about what I would say does not have any possible way to influence the experiment. As described above this is almost true for free, because the training data must be collected prior to creating an AI which cares about what I would say; as long as the AI can’t travel back in time it won’t be able to manipulate the results. The only thing to keep in mind is that there may be other agents who have an incentive to manipulate the experiment, and we should ensure that they don’t have an opportunity to do so. These issues are quite sensitive to e.g. the decision-theories used by the involved agents.
There is also a potential difficulty arising from the counterfactual in which the terminal displays unphysical messages. The consequences of this counterfactual may be significant. For example, people from the outside world who observed the phenomenon might radically change their attitude this experiment, and a small number of people might respond in an extreme way. So it seems sensible to limit the flow of information between the people interacting with the terminal and the rest of the world, even if we don’t go so far as to put the whole thing in a faraday cage with no connection to the outside world.