I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.
I described this proposal in a previous post; however, that post focused on technical details, and presented an implausible but extremely precise version. In fact I believe that this is a particularly promising angle for thinking about what we would like powerful agents to do. Here I’d like to look at the more qualitative picture and provide some of the basis for these intuitions.
What exactly is the proposal?
I want to define a utility function such that I would be happy if it were exceptionlessly maximized, for example by a powerful AI. The proposal is intended to be straightforward rather than particularly novel or ingenious.
The schema for my proposed utility function is: “The extent to which I would approve of the AI’s decisions and their consequences after an extensive process of reflection.” The three bolded terms require further specification.
“I” refers to the decision-making process that I currently implement (rather than to a concept which could be modified by e.g. changing my physical constitution or making a persuasive argument). I mean to refer to an input/output behavior, without concern for its internal structure.
“Extensive process of reflection” is used to refer to a situation in which a decision-making process is granted extremely extensive resources and may choose how to use them in order to arrive at a conclusion. In this case the conclusion is whether or not to approve of the AI’s behavior (see below). These resources are intended to be very extensive; for example, they should be sufficient to implement a successor which is considered to better represent the original processes’ “considered judgment” or any number of such successors; for example, this might be a smarter or more experienced version of itself, a committee of experts, a formalization of “what I would say if I didn’t make epistemic errors,” etc.
Note in particular that the original decision-maker may appoint a successor defined in terms of further stages of reflection, so that the process is effectively iterated until convergence.
Although this definition may appear very philosophical or vague, it can be made precise by clearly specifying “extensive resources” and the format in which the successor is described. Also note that “extensive resources” is intended to considerably exceed what is actually available to any agents existing in the real world, perhaps including access to unbounded computational resources which could be used to perform extensive searches for simple computations . The intractability of the procedure does not preclude the AI’s ability to reason about what its output would be, any more than it prevents us from reasoning about whether or not a particular action would be likely to be met with approval after massive amounts of reflection. However, it allows the evaluator to evaluate possibilities which are beyond the comprehension of any living person, and ensures that the evaluator can (predictably) see through any attempts by the AI to game the system.
“Approve of the AI’s decisions and their consequences” may mean that the evaluator has access to the literal list of decisions made by the AI, and may then use its own great resources to evaluate their consequences in detail. It then outputs a number in [0, 1] representing the extent to which it approves of this behavior. The AI’s goal is to maximize this number.
In a more extreme interpretation the evaluator may not have access to the outputs of the AI directly, but may be tasked with inferring them from what it already knows about the world, the code of the AI, and so on. I think this is a much more attractive, though less intuitively comprehensible, proposal: “outputs of the AI” does not seem to be a reliably meaningful concept, and in general it seems good to leave as much conceptual work as possible to the evaluator rather than to us. In this case the process of reflection would need to continue for a very long time, before a radically transformed version used a massive supply of resources to make appropriate inferences about the AI’s behavior. In this case the AI would also need to use a decision theory such as EDT/UDT/TDT, rather than CDT, since the AI’s behavior has no causal impact on the evaluator’s evaluation–the entire impact flows through what the evaluator will infer about how the AI would behave.
Note that the concept of “approval” need not coincide with some intuitive standard of behavior like “not actively bad;” instead, the evaluator can set the line arbitrarily. I use the expression “most approve of” rather than “think best,” because I want to emphasize the bounded nature of the judgments, and stress that the evaluator is free to avoid bad outcomes by disapproving of them, even if (in the AI’s judgment) there is a small probability that those outcomes would be judged extremely good. Without the boundedness condition I would not expect the procedure to lead to good outcomes, and if the evaluator exclusively judged the AI’s behavior by how good the resulting outcome was, I would not necessarily expect good outcomes.
Why is this desirable?
The most straightforward virtue of this proposal is that it is simple, can be made relatively precise, and appears likely to lead to good outcomes if implemented in the anticipated way. Indeed, there is some reason to think it would be as good as any proposal that we could think of (since we could arrive at such a proposal on reflection). See this post for an example of an attempt to make this proposal completely precise, and the later section of the same post for some arguments that the resulting outcome would be good. By contrast, I am not aware of many simple proposals that lead to desirable outcomes, or which have any realistic prospect of being rendered precise.
Perhaps more importantly, this proposal is robust in a number of very important respects
Robust to error: if we attempt to specify what we value directly, we seem to run into a host of potential errors. We may be wrong about what we want, may use the wrong framework for reasoning under normative or empirical uncertainty, may have an incorrect view of epistemology or have incorrect prior beliefs, may make decisions in the wrong way, may be vulnerable to blackmail, may be incorrectly obsessed with small probabilities of big payoffs, may incorrectly ignore small probabilities of big payoffs, may place too much or too little emphasis on further philosophical inquiry, etc.
The approval-based proposal has the virtue that it can avoid essentially all of these problems, by (1) delegating the underlying philosophical questions to the process of reflection, which is in a substantially better position than we are to resolve them, (2) allowing the evaluator to encourage the AI to revise arbitrarily fundamental aspects of its framework. For example, the evaluator could adopt a rule like “I approve of behavior if and only if it leads to the replacement of this AI by a new AI using framework Q.” This encourages an AI to reason at a very fundamental level about what kinds of design decisions its creators would have made differently given more reflection, and to make appropriate changes if it thinks the evaluator would prefer that it do so. It also allows the AI to make principled tradeoffs between course-corrections and more immediate concerns in a way which it believes best reflects the desires of its creators (since its creators can decide whether or not they feel it made such tradeoffs in an appropriate way and approve or disapprove as appropriate).
Robust to unfavorable circumstances: many proposals based on the promise of active collaboration between machine and human intelligence assume either that such collaboration is nearly as productive as unilateral action by machine intelligences, or assume that users of machine intelligences will coordinate to make (potentially modest) productivity sacrifices in exchange for greater fidelity. But in reality “keeping a human in the loop” may be difficult, especially in the long run as human labor becomes very expensive compared to computing resources. This solution does not require any assumptions about the availability of human labor, and it allows an AI to develop as rapidly as possible.
Robust to uncertainty: under realistic conditions a machine will be uncertain about what we would approve of. Because approval is bounded in [0, 1] there are limits on how much this uncertainty can distort its behavior. For example, suppose that the machine entertains a probability of 0.1 that we will behave in a very counter-intuitive way. If we behave in a counter-intuitive way, it may want to take counter-intuitive actions to maximize our approval. These might be actions which we would not approve of if we behaved in the intended way. However, because the machine can only receive 0.1 more expected approval by satisfying our counter-intuitive judgment, this consideration will never justify sacrificing more than 0.11 expected approval conditioned on our behaving in the anticipated way. It seems quite likely that we will have a large enough margin of error that this will make almost no difference to the long-run outcome (see the other points in this section).
Robust to conceptual imprecision: when we use language with humans, the best guarantee we typically obtain is that our words probably mean approximately the same thing to us and to them, in the kinds of situations we have encountered so far. The meaning may not be exactly the same, and in unfamiliar environments the concepts may diverge considerably. Let’s say that a concept communicated in this way is (at best) “probably approximately correct” (PAC).
Maximizing a PAC concept of goodness is likely to be problematic; if there are any possible outcomes on which the PAC concept and our actual concept diverge, then the process of maximizing the concept is unusually likely to push us into regions of divergence. These problems become even greater if the domain in which an agent is quite different than the one in which the concept was trained, for example if an agent is making decisions which will have implications for the long-run future.
However, the approval-seeking proposal appears to be robust to the use of PAC concepts for a number of reasons:
- “What I would say” may vary as the concept of “me” is varied, but on most readings of “me” I will tend to say things that are similar on the axes that are most important to me. The variation due to conceptual uncertainty is channeled into dimensions where I am particularly ambivalent.
- By channeling “what I would say” through a process of reflection, we can avoid asking “what I would say” in some extreme situation. For example, we need not ask “would I approve of your decision if I somehow understood all of the long-term consequences?”; this is good, since PAC concepts of “what I would say” might diverge in such an unusual and extreme case (in which we would have to extrapolate what it means to “really understand” something far outside of the usual range of human abilities). Instead, we can ask “What kinds of decision-making procedures do you think would be slightly wiser than your current self?” This is a question which can be posed to me as I currently exist, and two people who agreed about my behavior in a broad enough range of everyday situations would be likely to agree about my answers to this question.
- The process of reflection and approval can be specified unusually precisely; for example, they can be given a purely formal definition in terms of a given input/output behavior. This ensures that only the concept of “I” is subject to significant conceptual imprecision.
- The concept of “my decision process” also admits a relatively precise definition, which minimizes the extent of imprecision. This is because the physical instantiation of a person can be relatively cleanly separated from their environment, and can be transplanted from one physical environment to another. This includes, for example, an environment in which they interact with a computer. When a person interacts with a computer, they can implement a completely precise input/output behavior (the computer provides an input on the monitor, the human provides an output by typing). Putting these ingredients together, we can relatively unambiguously understand a human as implementing a precise input/output behavior.
- The process of approval can act on behaviors in ordinary environments rather than merely acting on the long-term consequences of an AI’s decisions, and so can yield good outcomes even if no extreme situations are ever encountered. For example, we might approve of an AI’s behavior only if it avoids doing anything that unnecessarily reduces human influence, or we can approve of the AI’s behavior only if it avoids making radical changes to its environment. At the same time, the process of approval can meaningfully depend on the long-term consequences of outcomes, and not merely short-term outcomes.
Comparison with CEV
This proposal has many important similarities to Eliezer Yudkowsky’s CEV; I read that document some time ago and I expect it has had an influence on my thinking. The primary difference is that this proposal relies on iterating the single relation, “A –> decision-making procedure that A endorses” defined in a relatively straightforward way–for example by presenting A with the tools to define a new decision-making procedure–rather than by relying on a more complex understanding of motivation, reflection, and idealization. One consequence is that this proposal is more amenable to formalization. Another is that this proposal leaves room for A to err (or for A’s replacement to err, and so on).
The difference seems to be due primarily to disagreement about “how good does the initial decision-making dynamic need to be, in order to avoid an unacceptable risk of significant error?” My overall view is that we can adopt a relatively straightforward rule such as this one and improve the rule with relatively straightforward modifications (such as taking a long time to think, consulting many copies of ourselves, having a discussion with a slightly enhanced version of ourselves to verify that their reasoning remains convincing to us, and a very wide range of less pithy techniques). If it is important to resolve hard philosophical problems, for example those required to define “volition,” this can then be done from a much-enhanced future state. It may be that a relatively straightforward rule is not sufficient, but in that case it seems clear that the existing human society is already doomed: after all, the process we are currently using to decide what successors to create is just one example of a particular simple rule for deciding on a decision-making procedure that we endorse.
A more charitable interpretation of this criticism is that one person operating in isolation is at a significant disadvantage, when compared to the existing community of thinkers. I find this objection relatively uncompelling, because the move from “what I would do” to “what I would do with appropriate experimental feedback” to “what a bunch of copies of me would do with appropriate experimental feedback” to “what our society would do under ideal conditions with appropriate experimental feedback” does not seem very complex or difficult. I agree that it may be more robust to work out formal versions of those decision-making procedures in advance, prior to communicating the objective to an AI, rather than trying to bake them into the reflective process. However, if this is the only sticking point with this proposal, it seems to be an engineering difficulty which should be addressed in the future as the process arises.
Another possible complaint is that CEV is handicapped in a way that is advantageous, by preventing the individuals owning the AI from securing a personal advantage. By contrast, I am mostly interested in the transition to AI because I think it presents a unique opportunity for the influence of humane values to be diminished. From this perspective, I am happy with proposals which simply allow individuals to maintain control over AI’s of their own construction, and I expect this would lead to a world with broadly similar prospects to our own. It may further be the case that a transition to AI could lead to massive inequality; I would prefer tackle this as a separate (probably primarily political) problem. I am open to the possibility that technical work could help address that problem, but I don’t see a comparably good argument that we should be doing that technical work now. (If I wanted to work on that problem now, I would probably pursue policies aimed at promoting more equal ownership of capital, or policies aimed particularly at ensuring that the fruits of an AI-based productivity boost were broadly shared.) It may also be the case that one could use a transition to AI as an opportunity to achieve a substantial change in social values, but in general I would be quite hesitant about such proposals.