# Counterfactual Blackmail (of oneself)

This post is not going to make any new claims, but this is a helpful intuition-pump for me.

### Loebian Problems in Decision Theory

Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn’t take X, but is uncertain about its utility if it takes X.

A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn’t take X, how is it supposed to reason about “what U would be if I take X?”

One approach is to exploit an agent’s uncertainty about its own future action. Although the agent’s action can be determined (by simulation), the agent doesn’t know its own outcome or the resulting utility, but may be able to discover some relationships between them. As long as the agent remains uncertain in this way, we can represent “what would happen if I take X” by conditioning my uncertainty about my own action.

Even though the agent cannot simulate itself to determine its own action, this approach is still problematic, due (at least) to the following reasoning:

Consider the statement L =  “If I found a proof of L and still did X, it would be terrible.” If I found a proof of L then I could also prove that I’d proven L, and so I could prove “If I did X, it would be terrible.” But then I wouldn’t do X, and so in this case it would be vacuously true that “If I did X, it would be terrible.”

But now I’ve shown that if I could prove L and still did X, it would be (vacuously terrible). This is precisely the statement L, so L is true and I’ve just proven it. So if I do X, it would be terrible. I guess I won’t do X.

So the behavior of our agent gets dominated by details of how it does the proof search (which have been obscured in the above argument, but obviously must play an important role) which is terrible.

### Blackmail and TDT

Suppose you are running some reflectively consistent decision theory and I am considering blackmailing you (and you are in the agent’s situation, trying to maximize U and considering whether to take action X). I tell you:

If you take action X, I will go out of my way to make sure that U is negative.

and make what you consider a credible commitment to do so. You are deciding how to respond: based on my commitment, it is clear that taking action X will give you negative utility, so as a good consequentialist you shouldn’t. But if your decision theory is reflectively consistent, you will (probably) realize that your decision in this situation may have two different types of consequences. In the “real” world, it determines your “real” action and therefore your utility. But it also may influence the blackmailer’s predictions of your behavior. If the blackmailer will only attempt to blackmail you if he believes that you will give in, then the mathematical fact that you give in to blackmail will cause you to be blackmailed.

Weighing both of these consequences, you may decide that it is better not to give into the blackmailer. (Although this issue seems to be complex: if the blackmailer is also running TDT he will predict that you will try to understand the relationship between your cooperation and the blackmailer’s decision to blackmail, and if he predicts this correctly he may commit to blackmailing anyway so that you have “no hope” of avoiding the blackmail. But if you predict that the blackmailer will reason in this way, you will decide to commit to not cooperating even sooner, and so on. If anyone knows what happens, I don’t.) For example, if the blackmailer is running a consequentialist decision theory and is a good predictor, it seems clear that you shouldn’t give in (and that TDT as it has been informally described won’t).

### Counterfactual Blackmail

Now suppose that you have a trusty “intuition module,” which is constrained to only output true statements but which is otherwise adversarial.

You are again deciding whether to take the action X, and your intuition module says to you:

If you take action X, you will receive negative utility.

If we view your intuition module as a separate entity, behaving adversarially, it seems clear that the situation is quite analogous to ordinary blackmail. If the statement is really true, then we shouldn’t take action X. But by being the sort of person who doesn’t take action X upon learning this fact, we are allowing the intuition module to honestly make such a statement. This is most clear if we operationalize the intuition model’s contract by saying it maximizes the utility function U’ defined by:

U’ = -infinity if the intuition module ever lies.

U’ = – U otherwise.

Then the intuition module will need to “check” if it can honestly output “If you take action X, you will receive negative utility.” Our decision about taking action X then has two effects (just like our decision in the ordinary blackmailing): it determines our real action, and it determines whether our intuition module can manipulate us in this way.

### The Difficulty

Things get substantially more difficult when the intuition module is internal to the agent. How can the agent distinguish between the output of the intuition module (which might be a lie if the agent is operating within the intuition module’s counterfactual) and the reasoning which justifies the agent’s response to the intuition module’s output? At some point some reasoning must be trusted as the basis for a decision; if we expect that every thought might be an artifact of a counterfactual than we descend into madness.

One approach to this problem is to have a system of trusted beliefs which is “too small” to permit these difficulties, and to treat the external environment as adversarial (though subject to some mathematical restrictions, which you can manipulate by means of your trusted beliefs). It may be possible to carry out some TDT reasoning using a very small set of beliefs which can be trusted, and to bootstrap up to larger agents with better understanding by manipulating the environment.

Nothing here can be stated formally, but here is the mental imagery behind this intuition. Consider a human operator, inside a “robot TDT” controlled from the inside. The human is able to handle some TDT-style reasoning, like the sort outlined above, but it cannot in general reason about statements complicated enough to talk about the robot it is controlling. Nevertheless, by using this sort of minimal TDT reasoning, it may be able to direct the robot sensibly/reflectively consistently. For example, if the human sees that the robot has produced a proof that “If I take X I will receive negative utility” the human can understand that it might just be in a counterfactual, so its decisions may have consequences other than the obvious ones.

(To the extent that this mode of thinking is useful, I expect it to be as an intuition pump for discovering what arguments should really motivate action. The human’s position seems to be little more than a restatement of our current situation with respect to decision theory, which intuition might do better on.)