Motivation

This blog contains a lot of material that is very weird. A natural question is: why write about such weird things? Many of the situations I’ve considered or problems I’ve written about involve improbable hypotheticals. We probably won’t ever have a powerful utility maximizer which we must provide a precisely defined utility function, for example. Why not focus on designing systems that are likely to actually get built?

The reason should not be so unfamiliar to a mainstream academic: though these problems may be too simple to be directly applicable, thinking about them sheds light on fundamental philosophical and mathematical issues. The heuristic is something like: where there are simple, fundamental problems on which it is possible to make progress, you should do it, even if you can’t see exactly why or how that progress will be useful. Continue reading

Generalized abstract incentives problems, and simulation arms races

Suppose we could build agents with arbitrary formally specified preferences, but that specifying human values turns out to be too challenging. A second approach to getting what we want is to give an agent relatively limited goals and capabilities, hopefully encouraging cooperation and helping us build more cooperative agents, or figure out how to better specify what exactly we want. If the agent is much smarter than us and perfectly selfish it may be a bit tricky; but if the agent is limited in some other way it may not be impossible. For example, you could try to build an agent which only cared about getting a cookie in the next hour, for some definition of “cookie” such that you are the only person who owns any cookies. Or you could try to build an agent which knew and cared only about abstract mathematical facts, and come up with useful problems it could solve in that domain. Right now I want to talk about the first approach, and highlight some difficulties. Continue reading

Oracles

Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design machines to achieve a formally specified goal in physical systems which can be modeled algorithmically. What would happen, when these oracles became widely available? Continue reading

A formalization of indirect normativity

This post outlines a formalization of what Nick Bostrom calls “indirect normativity.” I don’t think it’s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the “not terrible” bar, i.e. which does not lead to terrible consequences if pursued without any caveats or restrictions. Continue reading

Decision-Theoretic Hazards

Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. Continue reading

Formal Instructions Revisited

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate this civilization.”  Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences. Continue reading

“Proof” of Friendliness

The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.

Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity’s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn’t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate F(X) = “The humans run the program described by X.”

To top it off, with the help of your oracle AI you found the code for a “best possible AI”, call it FAI, and a proof of the theorem:

There exists a constant Best such that U ≤ Best, but F(FAI) implies U =  Best.”

Each of these steps you did with incredible care. You have proved beyond reasonable doubt that U and F represent what you want them to.

You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running FAI, they will still receive U ≤ Best, so why bother?

Now satisfied and with the scheduled moment arrived, you finally run FAI. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become as confident as possible that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).

Just a lighthearted illustration that decision theory isn’t only hard for AI.

(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)

Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn’t change the fact that you would rather not use this procedure.