I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.
Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it may be easy to get useful work out of them: for example, if I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout, I can simply solicit proposals, test the proposals, and award the employees according to the quality of their proposals.
However, for some kinds of tasks there may be fundamental problems with adopting this kind of incentive-based policy. For example, suppose I am interested in working with these employees to build an AI which is not only much smarter than any of us, but which will act autonomously in support of my values even when I can’t monitor its behavior. In cases like this, I will have to try something different. Read more…
I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I’m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an equilibrium where I say much more and feel more comfortable sharing unpolished thoughts. I think “quiet” is an OK equilibrium (most people who should read some of my thoughts shouldn’t read most of them, and it would make sense for me to try and be selective). But it seems like a suboptimal equilibrium, since there are at least a few people who do care what I think, often to better understand our disagreements.
A similar social problem is possible, where a broader intellectual community tends towards a “quiet” equilibrium in which any public speech comes with an implicit claim of interestingness and worthwhileness. I think intellectual communities vary considerably in where they sit on this spectrum. Bloggers qua bloggers often write quite a bit; academics in computer science tend to be much more guarded about what they say. I think there are some virtues to the noisier equilibrium, particularly in increasing our ability and inclination to notice, understand, and resolve disagreements.
Anyway, in that spirit, here is some of my thinking about AI—an outline of the development scenario I consider most likely, and a general discussion of the impacts of consequentialist automation. Criticism is welcome, and you can leave comments on the (quite hard-to-navigate) google doc here.
Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I’ll briefly argue:
If H is a simple hypothesis and the prior probability of H is very small, then there is nothing you could possibly see which would convince you of H.
Here is an argument that some people might find compelling:
It may be that the world is mad, and that as the only sane person around it falls on me to make sure we don’t all kill ourselves. If that’s the case, then my impact on the world may be huge. Let’s say that in this case, I can improve the world by 1%.
Maybe the claim that I’m particularly influential, call it proposition P, isn’t certain. But at least there’s a good chance. Subjectively it feels like about 1%, since if I looked at 100 similarly surprising facts, I would expect one of them to be true. (I wouldn’t be that surprised to discover that I’m the most important person ever…) That still leaves me with the ability to improve the world by 0.01% in expectation, which looks pretty good. I might as well not even worry about stuff I could do that would improve the world by a mere 0.001%, like being an extraordinarily successful entrepreneur.
Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see. In this post I’ll describe a problem with applying this approach to anything particularly important.
(Apologies: this post will probably be incomprehensible.)
In my post on Indirect Normativity, I describe the definition of a very complicated utility function U. A U-maximizer has a very interesting task: it is supposed to optimize a utility function that is completely opaque, and it is supposed to care about worlds (alternatively, assign worlds prior probabilities) according to a metric which is too complicated to understand! A U-maximizer may be able to understand the implications of the definition after amassing a great deal of computational power, or by understanding the process that generated the definition, but at first it would have no idea what U was. And if you are a U-maximizer, and you don’t know anything about what U is, you might as well behave randomly. This is problematic for two reasons: Read more…