I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.

Continue reading

# Category Archives: AI Safety

# Adversarial collaboration

Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it’s easy to get useful work out of them. For example, suppose I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout. Then I can solicit proposals, test the proposals, and award the employees according to the results.

However, for some tasks there may be fundamental problems with such an incentive-based policy. For example, suppose I am interested in hiring these employees to design an AI which is smarter than any of us and will act autonomously in the world. In cases like this, I will have to try something different.

# The motivated simulator argument

Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see. In this post I’ll describe a problem with applying this approach to anything particularly important.

# Oracles

Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design machines to achieve a formally specified goal in physical systems which can be modeled algorithmically. What would happen, when these oracles became widely available? Continue reading

# A formalization of indirect normativity

This post outlines a formalization of what Nick Bostrom calls “indirect normativity.” I don’t think it’s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the “not terrible” bar, i.e. which does not obviously lead to terrible consequences if pursued without any caveats or restrictions. Continue reading

# Decision-Theoretic Hazards

Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. Continue reading

# Formal Instructions Revisited

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate *this* civilization.” Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program). We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences. Continue reading

# “Proof” of Friendliness

The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.

Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of **U**, the utility of humanity’s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn’t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate **F**(X) = “The humans run the program described by X.”

To top it off, with the help of your oracle AI you found the code for a “best possible AI”, call it **FAI**, and a proof of the theorem:

There exists a constant

Bestsuch thatU≤Best, butF(FAI) impliesU=Best.”

Each of these steps you did with incredible care. You have proved beyond reasonable doubt that **U **and** F **represent what you want them to.

You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running **FAI**, they will still receive **U** ≤ **Best**, so why bother?

Now satisfied and with the scheduled moment arrived, you finally run **FAI**. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become *as confident as possible* that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).

Just a lighthearted illustration that decision theory isn’t only hard for AI.

(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)

Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn’t change the fact that you would rather not use this procedure.

# Counterfactual Blackmail (of oneself)

This post is not going to make any new claims, but this is a helpful intuition-pump for me.

### Loebian Problems in Decision Theory

Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn’t take X, but is uncertain about its utility if it takes X.

A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn’t take X, how is it supposed to reason about “what U would be if I take X?” Continue reading

# Confronting Incompleteness

Consider the “Truth game”, played by an agent A:

A outputs a sequence of mathematical assertions S1, S2, …

For each statement S, A receives exp(-|S|) utilons.

If A makes any false statements (regardless of how many it makes) it receives -infinity utilons (or just a constant larger than the largest possible reward).