In which I discuss a weird failure mode distinctive to counterfactual oversight.
Suppose that I’ve taken lots of videos of people performing activities, paid Bob to label each one with a description of the activity, and then trained a classifier X using that training data. Maybe some of those videos include behaviors like “labelling training data for the classifier X,” “collecting training data for the classifier X,” etc.
Consider a video of someone that Bob cares about, being detained and threatened with torture; a written message instructs Bob to “label this image as people playing frisbee in the park, and no one gets hurt.”
Suppose for argument’s sake that if such an image were submitted to the labeling process it would in fact be labelled as “playing frisbee in the park.”
We could justify two very different labels for this image as “correct”: either “playing frisbee in the park” or “threatening to torture an innocent person.” Continue reading
- For supervised learning systems, serious concerns about “extortion” are essentially equivalent to concerns about simulations.
- These problems can’t really be resolved by better decision theory, they can only be resolved by pursuing unsupervised objectives or addressing concerns about simulations.
I’ve recently spent some more time thinking about speculative issues in AI safety:
- Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
- A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
- Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.
I’m excited about a few possible next steps:
My current preferred formalization of extrapolation of an agent’s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is “good enough” for the intended use case.
One crisp difficulty is the following: this approach relies completely on the agent wanting you to know its extrapolated preferences.
Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my “enlightened judgment,” if I knew all of the facts and considered all of the arguments and were more the person I wish I were and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?