Extortion, simulation, and supervision

  1. For supervised learning systems, serious concerns about “extortion” are essentially equivalent to concerns about simulations.
  2. These problems can’t really be resolved by better decision theory, they can only be resolved by pursuing unsupervised objectives or addressing concerns about simulations.

Supervised learners

As examples, consider either:

  • a reinforcement learning system, trained to maximize the discounted sum of future rewards, or
  • a simple online learning system, trained to produce outputs that receive a maximal score. (This includes these concrete approval-directed agents.)

These systems take actions based on the predicted effects on their observations. If we were so inclined, we could described these as causal decision-theorists who “care” about the future experiences of observers like them.

The situation seems basically the same for any system trained by supervised learning. All such systems are implicitly optimizing for positive feedback, which is a property of their observations.

Extortion concerns

One reason to be concerned about “getting decision theory wrong” is that it may lead us to design systems that are vulnerable to extortion: for example, if someone approaches one of these model systems with a gun and asks for $20, the model systems will probably pay up — even if, had they been the type of system that wouldn’t have paid up, they wouldn’t have been mugged. In this case, we would have preferred to build the kind of system that doesn’t pay up.

For the most part this doesn’t seem like a big deal. It is understood that an attacker who gains physical access to a machine can influence that machine’s behavior.

Extortion is a more serious concern when other agents can make extortive threats without needing to compromise our system, for example by sending a letter saying “I’ve created 1000 simulated copies of you…” This could lead to somewhat surprising behavior.

Extortion ≡ simulation

For the supervised learners described above, what does this kind of extortion look like?

For these systems, a decision-relevant threat must be about the predicted observations of the system. This involves either a compromise of the physical integrity of the “intended” instantiation of the system, or else involve creating an additional instantiation of the system. (This is an intuitive rather than formal claim; we could also imagine weirder attacks that don’t go through physical instantiations of the system at all.)

That is to say, intuitively any attack must either compromise the system or else create copies of it. I’m going to refer to copies as simulations.

No decision-theoretic remedy

We might have two problems with such threats:

  1. If our system is exploitable in this way than it creates an inducement to create simulations.
  2. If such simulations exist, then our system will behave badly.

An appropriate decision theory might address [1], if we could ensure that our systems preferred not to be simulated. (This might be possible by ensuring that we give a low payoff to the offending behavior and a high payoff to the intended behavior, and that our agents place a reasonable probability on the no-simulation hypothesis.)

But even then, no decision theory will address [2]; as long as our system’s behavior is based on predictions about what will happen to it, and as long as it is willing to accept simulation arguments, than the existence of lots of simulations will cause unintended behavior. This may be less troubling if the mere existence of our system does not provide an inducement to create such simulations. But in a long future many such simulations will probably exist, and probably some of the would-be simulators won’t discriminate on the basis of decision theory. This could even happen in a big world, for sufficiently important or interesting systems.


So: either we can give up on supervised learning, or we can deal with the simulation problem. For supervised learners: if we deal with the simulation problem we have no serious extortion problem, and if we don’t deal with the simulation problem then we are going to get some surprising behavior regardless of our decision theory.

If we deal with the simulation problem, we can stop talking about extortion or blackmail. This is probably my main takeaway — I’m happy to collapse the two issues into one warning flag, and keep it in the back of my mind under the heading “simulations.” The similarity between these problems also makes me somewhat less concerned about unexpected “gotchas” in the future, though I’ll be looking.

If we give up on supervised learning… it’s hard to say where that leave us, other than in a tricky situation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s