I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.
Continue reading


Adversarial collaboration

Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it’s easy to get useful work out of them. For example, suppose I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout. Then I can solicit proposals, test the proposals, and award the employees according to the results.

However, for some tasks there may be fundamental problems with such an incentive-based policy. For example, suppose I am interested in hiring these employees to design an AI which is smarter than any of us and will act autonomously in the world. In cases like this, I will have to try something different.

Continue reading

The motivated simulator argument

Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see. In this post I’ll describe a problem with applying this approach to anything particularly important.

Continue reading


Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design machines to achieve a formally specified goal in physical systems which can be modeled algorithmically. What would happen, when these oracles became widely available? Continue reading

A formalization of indirect normativity

This post outlines a formalization of what Nick Bostrom calls “indirect normativity.” I don’t think it’s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the “not terrible” bar, i.e. which does not obviously lead to terrible consequences if pursued without any caveats or restrictions. Continue reading

Decision-Theoretic Hazards

Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. Continue reading

Formal Instructions Revisited

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate this civilization.”  Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences. Continue reading