My current attitude towards the Löbian obstacle is “just live with it.” This post outlines that view and some of the underlying intuitions.

To show I’m being a good sport, I’ll also propose a new reflection principle. But my best guess for the right answer remains “live with it.”

# Gödelian arguments for probabilistic reasoners

## The problem

Write P(x) for the probability I assign to a proposition x. Suppose that I am the student of a wise and careful teacher. Write Q(x) for the probabiliy the teacher assigns to x. We’re going to talk about diagonalization, so we’ll need to be very careful about the interpretation of Q. For concreteness, let’s say that Q(x) is the probability that my teacher would have assigned to x at 2:14 PM this Tuesday.

Now suppose this teacher is not just wise, but *very* wise: I believe that for everything I know, they know it, and for everything that I have thought of, they’ve thought of it. If I learn that this teacher asssigns x a probability of 70%, then I ignore my other evidence and assign x a probability of 70%.

In symbols, we might formalize this as follows: P( x | Q(x) = p ) = p, or perhaps P( x | Q(x) = p ) ≈ p.

The Gödelian counterexample is the sentence G := Q(G) < 0.5. In words, G asserts “The wise teacher assigns G a probability less than 0.5 at 2:14 PM on Tuesday.”

Suppose that my teacher assigns G a probability of 0.499 at 2:14:00 PM on Tuesday. Of course, being a perfect introspector, by 2:14:01 PM they will have revised this probability up to 1. But at 2:14:00 PM there was no way that the teacher could have known *exactly* what they would believe (for if they had known exactly, then they would necessarily have been exactly wrong).

But I shouldn’t agree with my teacher in this case. It is crazy to have P( G | Q(G) = 0.499 ) ≈ 0.499. After all, Q(G) = 0.499 implies G, so we should have P( G | Q(G) = 0.499 ) = 1.

This leads to a problem. How do we formalize my trust for this teacher, if not with the equation P( x | Q(x) = p ) ≈ p?

## Some responses I don’t like

There are many standard answers, ported to the probabilistic setting from mathematical logic:

- Maybe we revise our epistemic theory, so that the teacher can be agnostic rather than assigning a probability to sentences like G. I find this proposal pretty unsatisfying. The teacher believes that G is a coin flip, they don’t assign it some special epistemic status. They would be happy to take a bet at favorable odds, if their counterparty were equally ignorant. This response is especially unhelpful if we were considering actually
*delegating*a task to the teacher. Reality won’t generally give them the option to abstain. - Maybe we limit our language to eliminate self-referential sentences. But self-referential sentences are just like any other sentences (if I give you a diophantine equation, can you even tell whether it describes your own beliefs?) so this seems doomed. It is even more doomed if we are specifically interested in reflective reasoning (for example an agent reasoning about its own future behavior, in order to estimate the value of information).
- Maybe we weaken our trust, and assert “I trust the teacher up to infinitesimal error” or a variation on this theme. This approach seems unsuitable when Q is just a Turing machine, however. What does it mean to be “infinitesimally” wrong about the output of a Turing machine? Even a sufficiently small constant error ε seems to cause trouble.

It feels to me that we should be trying to reproduce human judgments about these cases (at least as one task; we might have further goals). That is, I actually have some intuitions about how I would update on the judgments of a very wise teacher. So why not try to make a machine do the same thing? I think this is the knee-jerk response of most people who hear about the problem, and one reason it has a hard time getting traction with non-mathematicians.

I don’t think this is a trivial issue at all; I think that most humans’ intuition about how their brain handles this problem are wrong, and I have a lot of sympathy for people who work on this problem and grow impatient with the knee-jerk response. But I do think that there is a lot to be said for this intuition, of trying to reproduce our actual beliefs about cases like this, at least as a first step before judging them to be unsatisfactory.

# My favored response

## What humans do

What do I really believe about the teacher’s judgment? If I was to be more careful I might say: “if the teacher knows everything that I do about a situation, *then* I should trust their judgment in that situation.”

For example, suppose I flip a coin and look at it. I see it’s heads, but don’t show my teacher. I ask my teacher what they think, and they guess that it’s 50-50. Of course I’m not going to change my belief back to 50-50, no matter how wise I think they are.

This formulation P( x | Q(x) = p ) ≈ p is baking in the universal a claim that there isn’t any information I have which the teacher lacks. That sounds OK so far. This is a perfectly legitimate thing to assume, though it’s almost never going to be satisfied in a meaningful case, including cases to do with self-improvement. But let’s set that aside for the moment.

The reason this assumption actually leads to trouble is that P *is* conditioning on a further fact, namely that Q(x) = p, and this further fact is most definitely *not* something that Q knows–that’s the whole point of the Gödelian sentence under consideration, that it exploits Q’s imperfect self-knowledge. In symbols, P( Q(x) = p | Q(x) = p ) = 1, even if x implies Q(x) ≠ p. But this really doesn’t seem surprising or objectionable; P is just conditioning on a further fact that wasn’t available to Q, and we already know that P *should* disagree with Q when it has extra information.

So the principle I typically recommend is P( x | Q(x) = p ) ≈ p, unless there is a good reason that it shouldn’t be, such as P knowing that Q’s judgment doesn’t reflect some particular datum available to P. This is not a rule of inference which we should bake into P if we want to build an absolutely-Q-trusting agent, this is just a fact which we think ought to be true about P and Q, if in fact P has adequate reason to judge Q to be trustworthy.

This entire discussion also applies if Q = P, or if Q is a modified version of P, or a version that knows some extra facts, or whatever. And the issue is identical: generally the version of me tomorrow knows everything I know, but if I condition on the version of me tomorrow believing X, now I know a whole extra fact (and of course there are typically many other pragmatic reasons that I might know something that I won’t reliably know tomorrow).

## So now what?

In the context of self-modification, this doesn’t resolve the problem of how an agent comes to trust itself. It’s just telling us what the goal is. That is, we should endeavor to build an agent P such that P( x | P(x) = p ) ≈ p in general, except when it shouldn’t be.

I admit, that’s a pretty unsatisfying answer to the question “What should an agent believe about itself?” But it is a philosophically *acceptable * answer to that question. If I ask “what should an agent believe about a wise and reliable teacher?” it seems clear that the right answer is messy. There’s not a simple formula expressing “is a wise and reliable teacher.” They should believe the things that lead them to make productive use of the teacher’s advice, in precisely those cases where the teacher’s advice is actually useful.

This is also a fine answer from an engineering perspective. An agent reasoning about its own survival, or about building a near-copy of itself, can do expected utility maximization just like the rest of us. It need not have perfect trust in its own reliability, and there need not be any universal argument that trying to survive is a good. We just hope that *if* survival is actually good, and *if* the agent we build is a good reasoner, then it will correctly reason and arrive at the correct conclusion. This is exactly the same thing we hope our agents do in other cases.

In fact in cases where the agent justifiably deviates from self-trust, it is right to do so. In the face of an appropriate Gödelian difficulty, a situation in which the agent’s values are satisfied by doing whatever the agent wouldn’t, it can be rational for the agent to behave surprisingly.

In addition to be an unsatisfying desiderata, this answer also leads to a new question: *how* does an agent actually arrive at this kind of self-trust? If we can’t engineer self-trust in as a rule of inference, then where will it come from?

Unsurprisingly, my answer is: it comes from the same mix of inductive and deductive reasoning that leads to most of our beliefs. If I see that the algorithm I’m running works well in a bunch of cases, I suspect it will work well in new cases. If I can reason about the algorithm and see what it’s doing and why it’s working, I can be more confident. I can write proofs about a system, and those proofs relate my views about some propositions (like axioms) to other propositions (like the system’s trustworthiness). None of this is mysterious.

## Reasoning about yourself

This raises yet another question: how can we know that the systems we build will actually successfully carry out this reasoning?

My answer to this is:

- How do we know it will successfully carry out any reasoning? Probably this is going to be on the same footing. You have to think about why the agent reasons well in general, and try to build agents that reason well. You can’t answer this question without knowing more about why the agent’s reasoning works in general, since it’s exactly the same question.
- I don’t really see how we end up with the situation where an agent is a great general-purpose reasoner, able to make human-level inferences (both inductively and deductively) but completely unable to reason about itself. In fact, the relevant threshold seems to be whether it has human-level ability to reason about the fidelity of its own successors. If it had human-level ability, there is no way that we would be able to say “This process will reliably lead to catastrophic value drift” but the agent itself wouldn’t recognize that! And we can all grant that there
*do*exist modifications for which normal human-level reasoning gives adequate confidence, for example no-ops, observations, and minimal capacity improvements. So an agent with merely human-level reasoning about this issue wouldn’t be forced to do something low-fidelity, it would just have to be cautious, in the same kind of ways that we might be cautious. - We can, in advance, try to study the kind of reflective reasoning that an agent would be carrying out in these cases. I don’t know if this is at the top of the priority list of things to do, but it is certainly an interesting exercise that I would expect to substantially improve our understanding of the situation.

[3] requires more elaboration, because I think this really is the answer to “what now?” for people who are interested in reflection *per se *or who prefer it to other problems for various pragmatic reasons.

I see an analogy with the kind of reflection that an automated mathematician is capable of. That is, *if* an automated mathematician has a truth predicate which it can apply to its axioms, *and* it believes its axioms are true, *then* it infers that everything it proves is also true. And that seems great. This allows the automated mathematician to develop the abstraction of truth, and the truth of the underlying axioms, by inductive reasoning about the world, and then to infer from that its own correctness (and moreover to reason about what changes to its own algorithm would be productive). It also gives us a lot of confidence that “proof” is a useful notion.

In the case of probabilistic reasoning we don’t really have analogous arguments, which an agent might use to better understand its own reasoning. We could imagine finding a prior which “thought itself reasonable,” such that an agent using that prior could look at its own reasoning process and conclude something about its own reliability. I don’t think that this would be done by having a prior P with “P is reasonable” baked in; instead it would arise in the same way that an agent that trusts proofs will naturally realize that its own deductive process is reasonable, because there *is* an objective sense in which proofs preserve truth. I can see various lines of attack on this problem, though they also seem quite difficult.

I think that this is a realistic objective, and one that motivated some of my own work on probabilistic reasoning about mathematics. Certainly a success would be interesting. It also seems conceivable that sorting through these questions will involve significant shifts in our understanding, or that this kind of reflective self-certification will be a useful tool in helping us understand what kind of belief formation process is reasonable in the probabilistic setting (in the same way that the “proof” that proofs preserve truth is an important part of the modern view that proof is a useful way to reason).

At the same time, it’s totally conceivable that there just won’t be a clear story. There isn’t any particular reason that there needs to be a further story, and it doesn’t seem like it would be too terrifying if there weren’t. The situation seems similar to the one humans have been in for most of history: we had no particular reason to expect there to be a nice framework like probability for reasoning under uncertainty, but there was. We knew no particular reason for there to exist this beautiful complete and sound deduction procedure for first order logic, or even for there to exist a system as awesome as first order logic, but there was. I don’t think this is a law of nature, but I do think that the historical record suggests that it’s often worthwhile to be optimistic and try to come up with clean formal frameworks.

## A note on non-monotonic reasoning

One reason that the intuitive response outlined above is unsatisfying to many people is that it has the form “You should think X unless there is a reason to think not X.” Of course this is an exceptionally common pattern of reasoning in ordinary life. That said, it makes it very difficult to prove that your system will reason in the intended way. After all, what if there *was* a good reason for not X, and you just didn’t notice?

Proofs, on the other hand, have this desirable characteristic that if you prove X, then you’ve proved X for good, and under plausible conditions you shouldn’t prove any false X. You can’t have overlooked anything, you won’t revise the view later. This property is closely related to what is called *monotonicity*. As you learn more facts, you are able to prove more things.

I can see the attraction of this property. But it’s obviously unattainable for almost all inferences of interest, and I haven’t really seen good arguments for demanding it in this case. Indeed, we *know* that there are cases where if you think a bit longer you will realize that you shouldn’t trust yourself in some particular future case. That’s just how it goes; we have to live with it.

If you really want to prove theorems about the good behavior of a proposed system, rather than using assumptions like “Assuming the agent doesn’t believe anything false,” I think you should be using assumptions like “Assuming that we accept the agent’s assessments of probabilities.”

Amusingly, articulating this assumption formally runs into exactly the same Gödelian obstructions. My response? Probably we just have to live with it.

# A reflection principle

Despite all that, I *do* think it would be convenient if we could have a clean, formal, statement of what it means to regard someone’s judgment as perfectly trustworthy (including assuming that “they know everything I do”). So let’s say we want to write down a formal rendition of “P trusts Q completely,” a replacement for the inconsistent P( x | Q(x) = p ) = p.

It seems like in order to construct such a definition, we need to have P condition on Q’s probability assessments. So if Q isn’t good at introspection the whole project seems sunk, since by conditioning on Q’s probability assessments P will actually be learning something that Q doesn’t know. This is a further restriction on Q, but I’m not too troubled by that; I think it’s reasonable for agents to have a pretty accurate view of their own internal state.

So let’s require that if Q(x) = p, then Q( Q(x) ≈ p ) ≈ 1. That is, let’s assume that Q knows its own probability assessments up to some small error. It’s inconsistent to assume that Q knows its own assessments exactly, but knowing its own assessments approximately doesn’t seem inherently problematic.

OK, so now we’d like to have P condition on Q’s judgments without in the process learning something that Q doesn’t know. The obvious approach is to consider P( x | Q(x) ≈ p ), but this turns out to be problematic whenever Q(x) ≈ p is replaced by *any* predicate relating Q(x) and p. The basic problem is that P( Q(x) ≈ p | Q(x) ≈ p) = 1, regardless of the interpretation of ≈.

Once we’re thinking about things in this way, a natural approach is to have P condition on a noisy observation of Q(x). That is, let E be a new error term which is close to 0 and which doesn’t appear in the expression x. For example, let η be a symbol representing a noise variable that is uniform in [-ε, +ε] and which doesn’t appear in x, and let p’ be uniform in [p – ε, p +ε]. Then we should have E[P(x | Q(x) + η = p’)] ≈ p, where the expectation is taken over p’. In fact we can define the distribution P( • | Q(x) + η = p’ ) using Bayes’ rule without even introducing the symbol η into the language.

This is a tool for capturing the intuition that we should cause P to condition on something, but simultaneously shouldn’t tell it anything that Q doesn’t know about x. P has indeed learned something that Q didn’t know, namely about the error term E. But since Q already mostly knew its own beliefs, the extra data P learned was (mostly) unrelated to the question x.

Of course such an observation will still shift P’s beliefs about x: it now knows (a bit) more about Q(x) than is reflected in Q(x) itself. But now this change should be reflected in a small shift in P’s beliefs, because Q already *almost* knew the value of Q(x). We can now formulate an approximate reflection principle, parametrized by a small ε > 0:

For every x, E[P(x | Q(x) + η = p’)]∈ [p-ε, p+ε], where η is a symbol representing a U(-ε, +ε) random variable that doesn’t appear in x and p’ is distributed according to U(p-ε , p+ε).

I think this principle dodges the immediate Gödelian problems and is consistent, by the same argument as given in this preprint. I don’t know if it satisfies various stronger criteria, like being satisfied by an actual pair of mathematical definitions P and Q.

There are various other subtleties in the definition. For example, note that in order for this to be satisfied, if G = Q(G) < 0.5 then Q(G) ∈ [0.5 – ε^{2}, 0.5 + ε^{2}], which is perhaps not what you’d expect from the definitions. Nothing looks like an obvious problem, but on the whole it seems unclear whether all the pieces will fit together. But the overall idea seems to me like an attractive one.