Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my “enlightened judgment,” if I knew all of the facts and considered all of the arguments and were more the person I wish I were and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?
Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of “maximize the extent to which I would approve of your decision upon reflection.” (I have suggested that we would be happy with a powerful AI that made decisions according to this maxim.)
I have written about this issue in the past; in this post I’ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I’m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.
I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.
Will machine intelligences communicate with humans by directly exposing or reporting properties of their internal state, or will they tend to communicate by strategically choosing utterances that they think will have the intended effect on the listener? In this post I try to lay out the distinction more clearly and describe some differences.
I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.
Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it’s easy to get useful work out of them. For example, suppose I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout. Then I can solicit proposals, test the proposals, and award the employees according to the results.
However, for some tasks there may be fundamental problems with such an incentive-based policy. For example, suppose I am interested in hiring these employees to design an AI which is smarter than any of us and will act autonomously in the world. In cases like this, I will have to try something different.
I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I’m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an equilibrium where I say much more and feel more comfortable sharing unpolished thoughts. I think “quiet” is an OK equilibrium (most people who should read some of my thoughts shouldn’t read most of them, and it would make sense for me to try and be selective). But it seems like a suboptimal equilibrium, since there are at least a few people who do care what I think, often to better understand our disagreements.
A similar social problem is possible, where a broader intellectual community tends towards a “quiet” equilibrium in which any public speech comes with an implicit claim of interestingness and worthwhileness. I think intellectual communities vary considerably in where they sit on this spectrum. Bloggers qua bloggers often write quite a bit; academics in computer science tend to be much more guarded about what they say. I think there are some virtues to the noisier equilibrium, particularly in increasing our ability and inclination to notice, understand, and resolve disagreements.
Anyway, in that spirit, here is some of my thinking about AI—an outline of the development scenario I consider most likely, and a general discussion of the impacts of consequentialist automation. Criticism is welcome, and you can leave comments on the (quite hard-to-navigate) google doc here.
Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I’ll briefly argue:
If H is a simple hypothesis and the prior probability of H is very small, then there is nothing you could possibly see which would convince you of H.