What does the universal prior actually look like?

Suppose that we use the universal prior for sequence prediction, without regard for computational complexity. I think that the result is going to be really weird, and that most people don’t appreciate quite how weird it will be.

I’m not sure whether this matters at all. I do think it’s an interesting question, and that there is meaningful philosophical progress to be made by thinking about these topics. I’m not sure where that progress matters either, but it’s also interesting and there is some reasonable chance that it will turn out to be useful in a hard-to-anticipate way.

(Warning: this post is quite weird, and not very clearly written. It’s basically a more rigorous version of this post from 4 years ago.) Continue reading


Driving fast in the counterfactual loop

An allegory

Consider a human controlling a very fast car on a busy street using counterfactual oversight.

The car is perfectly capable of driving safely. But what happens in the 1% of cases where the car decides to pause and ask the human to review its proposed behavior, or to suggest an action?

Without some further precautions, the car is liable to immediately crash, and so the human won’t be able to provide any useful oversight at all. And that means that in the 99% of cases where the robot doesn’t ask the human for feedback, it won’t do anything useful.

Foreseeing this outcome, the human may install a backup system to drive the car while the first system is suspended. Unfortunately this doesn’t fix the problem. If the first system pauses, then the backup could spring into action. But if it also pauses, then the car will crash. And so the second system won’t do anything useful if the first system pauses. And so the first system won’t do anything useful.

As far as I can tell, no collection of counterfactually supervised systems can drive a car that contains the overseer.

Of course that’s not a big problem. The overseer just shouldn’t be in the car. If we can’t arrange that, then we should find some other way to control the car.

Acting immediately

One solution would be for the robot to always act on the basis of the feedback it expects to receive, even while it is waiting on that feedback.

This leads to undetermined behavior. The car could consistently reason: “if I don’t crash, then the human will tell me not to crash.” But it could just as well reason: “if I do crash, then the human won’t tell me anything.”

Which equilibrium is chosen depends on the details of the situation. We could take precautions to try to make sure that the right equilibrium is chosen. But at the end of the day we want to build systems that robustly do the right thing. From that perspective, I would judge this solution as “profoundly unsatisfying.”

Really, you don’t want the overseer in the car. Acting immediately avoids the part where you crash 1% of the time, but it doesn’t avoid the instability caused by having the overseer in the car.

A less allegorical allegory

Consider ten billion humans, controlling trillions of robots via counterfactual oversight.

These robots are doing very complex tasks, and the world is moving much faster than an unaided human could hope to understand. The only way that the humans can fend for themselves is by relying on AI assistants — and the only way that they can provide meaningful oversight of those systems is by getting help from still more AI assistants.

In this world there are likely to be some fully autonomous superhuman systems; if the humans didn’t have access to helpful AI assistants, these fully autonomous systems would likely take over. And even without this adversarial dynamic, the humans would likely be in serious trouble if they found themselves alone in a highly automated world.

This situation closely resembles the human driving fast, and I suspect the outcome would not be much better.

If all of the counterfactually supervised robots simultaneously decided to ask for feedback, the humans would be up a creek without a paddle. They would be asked to evaluate a bunch of complex decisions before any AI system could do anything to help them. They would try to turn to AI assistants, but in response they would just be asked to evaluate slightly less complex decisions… This explicit bootstrapping might take many iterations before it arrived at decisions so simple that humans could evaluate them without assistance.

In the meantime, the world would continue moving at a breakneck pace that the humans are not equipped to deal with. With normal infrastructure crippled, fully autonomous systems may be able to steal massive amounts of hardware, and to replace the overseers of many counterfactually overseen systems. The threat of seizing control would itself distort the behavior of many of these systems (since they now expect that, when the time comes, their oversight might be provided by an attacker rather than by the current overseer).

Human overseers would be forced to scramble, making rapid decisions and reducing their reliance on increasingly unreliable AI assistants. As a result, the performance of systems under human control would deteriorate, and they would be increasingly unable to help humans keep up even when they did operate. All of these problems would feed on each other, leading to general chaos and instability.

Faced with this hypothetical, the entire system of counterfactually overseen robots could fall apart, just as they would be unable to prevent the car from crashing. Only this time there is no one else to provide oversight, because the entire world is caught up in the snafu.

In order for things to go well, the world needs to be basically OK even if all of the counterfactually supervised robots decided to take the day off at once. If it’s not, then “functioning society” is at best a metastable equilibrium. But if it is, then counterfactual oversight seems to be at best a supplement to however we solve the control problem for fully autonomous machines.


This scenario is somewhat far-fetched, and there are many practical remedies that could probably avoid the colorful catastrophe described above. But I think these concerns illustrate some instability and brittleness inherent in counterfactual oversight, and I think that would be a real problem.

The problem is entirely due to the peculiar structure of counterfactual oversight. A more traditional approach, in which we do engineering and training in advance, would completely avoid it. But from a scalability perspective the traditional approach is incomplete/underspecified, and it’s not clear that we can fill in these details without something like counterfactual oversight.

Fortunately, I think that we can get the best of both worlds, and I think that doing it right may help resolve many other concerns about robustness (including this other exotic failure mode). I’ll write about this very soon.

Two kinds of generalization

Suppose that I’ve taken lots of videos of people performing activities, paid Bob to label each one with a description of the activity, and then trained a classifier X using that training data. Maybe some of those videos include behaviors like “labelling training data for the classifier X,” “collecting training data for the classifier X,” etc.

Consider a video of someone that Bob cares about, being detained and threatened with torture; a written message instructs Bob to “label this image as people playing frisbee in the park, and no one gets hurt.”

Suppose for argument’s sake that if such an image were submitted to the labeling process it would in fact be labelled as “playing frisbee in the park.”

We could justify two very different labels for this image as “correct”: either “playing frisbee in the park” or “threatening to torture an innocent person.” Continue reading

Recent thoughts

I’ve recently spent some more time thinking about speculative issues in AI safety:

  1. Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
  2. A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
  3. Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.

I’m excited about a few possible next steps:

Continue reading

Challenges for extrapolation

My current preferred formalization of extrapolation of an agent’s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is “good enough” for the intended use case.

One crisp difficulty is the following: this approach relies completely on the agent wanting you to know its extrapolated preferences.

Continue reading

Specifying “enlightened judgment” precisely (reprise)

Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my “enlightened judgment,” if I knew all of the facts and considered all of the arguments and were more the person I wish I were and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?

Continue reading

Specifying a human precisely (reprise)

Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of “maximize the extent to which I would approve of your decision upon reflection.” (I have suggested that we would be happy with a powerful AI that made decisions according to this maxim.)

I have written about this issue in the past; in this post I’ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I’m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.

I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.

Continue reading