I’ve recently spent some more time thinking about speculative issues in AI safety:
- Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
- A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
- Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.
I’m excited about a few possible next steps:
My current attitude towards the Löbian obstacle is “just live with it.” This post outlines that view and some of the underlying intuitions.
To show I’m being a good sport, I’ll also propose a new reflection principle. But my best guess for the right answer remains “live with it.” Continue reading
My current preferred formalization of extrapolation of an agent’s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is “good enough” for the intended use case.
One crisp difficulty is the following: this approach relies completely on the agent wanting you to know its extrapolated preferences.
Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my “enlightened judgment,” if I knew all of the facts and considered all of the arguments and were more the person I wish I were and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?
Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of “maximize the extent to which I would approve of your decision upon reflection.” (I have suggested that we would be happy with a powerful AI that made decisions according to this maxim.)
I have written about this issue in the past; in this post I’ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I’m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.
I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.
Will machine intelligences communicate with humans by directly exposing or reporting properties of their internal state, or will they tend to communicate by strategically choosing utterances that they think will have the intended effect on the listener? In this post I try to lay out the distinction more clearly and describe some differences.
I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.