Approval-seeking

I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.
Continue reading

Some rambling AI prognostication

I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I’m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an equilibrium where I say much more and feel more comfortable sharing unpolished thoughts. I think “quiet” is an OK equilibrium (most people who should read some of my thoughts shouldn’t read most of them, and it would make sense for me to try and be selective). But it seems like a suboptimal equilibrium, since there are at least a few people who do care what I think, often to better understand our disagreements.

A similar social problem is possible, where a broader intellectual community tends towards a “quiet” equilibrium in which any public speech comes with an implicit claim of interestingness and worthwhileness. I think intellectual communities vary considerably in where they sit on this spectrum. Bloggers qua bloggers often write quite a bit; academics in computer science tend to be much more guarded about what they say. I think there are some virtues to the noisier equilibrium, particularly in increasing our ability and inclination to notice, understand, and resolve disagreements.

Anyway, in that spirit, here is some of my thinking about AI—an outline of the development scenario I consider most likely, and a general discussion of the impacts of consequentialist automation. Criticism is welcome, and you can leave comments on the (quite hard-to-navigate) google doc here.

Improbable simple hypotheses are unbelievable

Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I’ll briefly argue:

If H is a simple hypothesis and the prior probability of H is very small, then there is nothing you could possibly see which would convince you of H.

Continue reading

Are you special? Pascal’s wager, anthropic reasoning, and decision theory

Here is an argument that some people might find compelling:

It may be that the world is mad, and that as the only sane person around it falls on me to make sure we don’t all kill ourselves. If that’s the case, then my impact on the world may be huge. Let’s say that in this case, I can improve the world by 1%.

Maybe the claim that I’m particularly influential, call it proposition P, isn’t certain. But at least there’s a good chance.  Subjectively it feels like about 1%, since if I looked at 100 similarly surprising facts, I would expect one of them to be true. (I wouldn’t be that surprised to discover that I’m the most important person ever…) That still leaves me with the ability to improve the world by 0.01% in expectation, which looks pretty good. I might as well not even worry about stuff I could do that would improve the world by a mere 0.001%, like being an extraordinarily successful entrepreneur.

Continue reading

Incomprehensible utility functions

(Apologies: this post will probably be incomprehensible.)

In my post on Indirect Normativity, I describe the definition of a very complicated utility function U. A U-maximizer has a very interesting task: it is supposed to optimize a utility function that is completely opaque, and it is supposed to care about worlds (alternatively, assign worlds prior probabilities) according to a metric which is too complicated to understand! A U-maximizer may be able to understand the implications of the definition after amassing a great deal of computational power, or by understanding the process that generated the definition, but at first it would have no idea what U was. And if you are a U-maximizer, and you don’t know anything about what U is, you might as well behave randomly. This is problematic for two reasons: Continue reading

Recognizing a Halting Oracle

Suppose we found a black box which purported to be a halting oracle. You feed the box an input, and it outputs 0 or 1. As it turns out, we couldn’t verify such a claim. Nevertheless, we could tell that such a box was pretty magical. Exactly how much could we tell? That is, if you ask the box a question of the form “Does Turing machine M halt?” when can you verify that its answer is accurate? Note that, in order to do this verification, you may be able to ask the box some further questions. In general, you could use halting queries to make the box simulate a deterministic prover’s role in some interactive proof protocol (i.e., does the machine that finds the lexicographically first valid prover strategy and then halts iff the prover’s strategy outputs 1 at this stage, halt?)

If all you care about is computability, you can’t do anything. If you could learn something interesting from an untrusted halting oracle, you could have just looped through all strings until you found one that convinced you to trust it. The interestingness of the situation comes from a computational complexity perspective.

[Disclaimer: I don’t think this is useful.] Continue reading

Generalized abstract incentives problems, and simulation arms races

Suppose we could build agents with arbitrary formally specified preferences, but that specifying human values turns out to be too challenging. A second approach to getting what we want is to give an agent relatively limited goals and capabilities, hopefully encouraging cooperation and helping us build more cooperative agents, or figure out how to better specify what exactly we want. If the agent is much smarter than us and perfectly selfish it may be a bit tricky; but if the agent is limited in some other way it may not be impossible. For example, you could try to build an agent which only cared about getting a cookie in the next hour, for some definition of “cookie” such that you are the only person who owns any cookies. Or you could try to build an agent which knew and cared only about abstract mathematical facts, and come up with useful problems it could solve in that domain. Right now I want to talk about the first approach, and highlight some difficulties. Continue reading

Speed Prior and MWI

The speed prior is an analog of the universal distribution which penalizes computationally expensive hypotheses. The speed prior is of theoretical interest, but it is also justified if we suspect that we are living in a simulation: computationally simpler hypotheses requires fewer resources and we should therefore expect them to be simulated more times (if constant resources are dedicated to several independent simulations, a simulation which is twice as expensive can only be run half as many times).

It has been argued that the speed prior implies that wave function collapse is probably real, because simulating other branches would be computationally expensive, and that we should not expect quantum computation to work. This strikes me as naive: we have practically smuggled in the hypothesis of a classical universe by introducing the speed prior (or at least by suggesting that it reflects reasonable beliefs for an observer being simulated).  Continue reading

Avoiding Simulation Warfare with Bounded Complexity Measures

Suppose that I try to define a human’s decision process by observing some decisions and conditioning the universal prior on agreement with those decisions (see here). I have argued that the behavior of the result on new decisions is going to be dominated by the winner of a certain simulation arms race–most of the mass of the universal prior will be spread out amongst simulations spread throughout our universe, and whoever controls those simulations determines the resulting posterior.

This state of affairs is pretty much terrible, and in general the obstruction seems so bad that we shouldn’t ever try to specify anything formally by conditioning the universal distribution and hoping to get a pointer to our universe.

The problem is that describing an event as “that thing in the universe which starts off looking like this…” is dangerous. Instead, we would really like to specify a human’s decision process by saying “you arrange all of these atoms like so, and then you apply these laws of physics…” (This would also deal with the problems with specifying counterfactuals, although those seem to be less severe.)

This is likely to lead to a much larger description, but one which can’t be nearly so easily controlled by a simulator. Fortunately, though the resulting description is longer it is still a very good compression of the human’s behavior, and it also has the virtue of being computationally inexpensive. So we may hope to pinpoint this description by using a prior which weights explanations not by the Kolmogorov complexity, but by some different complexity measure which favors computationally inexpensive simulations.

Continue reading

Entropy, Reversibility, and Uncomputation

Summary: Anything you can compute using an infinite amount of negentropy, time T, and space S, can be computed using only about S negentropy if you are willing to spend a little bit of extra space and time (S log T and T^1.5, for example, or S and 2^S * T). So future universes may be constrained by available space and time rather than negentropy, and the potential for computation may be exponentially larger than we would otherwise expect. Continue reading