(Apologies: this post will probably be incomprehensible.)

In my post on Indirect Normativity, I describe the definition of a very complicated utility function U. A U-maximizer has a very interesting task: it is supposed to optimize a utility function that is completely opaque, and it is supposed to care about worlds (alternatively, assign worlds prior probabilities) according to a metric which is too complicated to understand! A U-maximizer may be able to understand the implications of the definition after amassing a great deal of computational power, or by understanding the process that generated the definition, but at first it would have no idea what U was. And if you are a U-maximizer, and you don’t know anything about what U is, you might as well behave randomly. This is problematic for two reasons:

- We’d prefer get sane behavior as soon as possible, rather than waiting until a U-maximizer amasses the computational power necessary to understand U. If U says “don’t kill anyone!” and you can’t figure out U until you’ve murdered a million people, that could be considered problematic. We’d like to build agents that can intuitively reason “killing someone might be bad” without being able to figure out there own utility function in complete detail. But in this setting, not only can the U-maximizer not figure out exactly what U is, without doing really sophisticated mathematics they can’t tell
*anything at all*about it. - Any given procedure for
*acquiring*more computational resources must rely on assumptions about what world you are in. A certain strategy will work in some worlds, but not work in other worlds. And if which worlds you care about is encoded in U, and you can’t figure out U, how do you know to take actions that help you compute things in simple worlds? Why not take actions that would be useful for computing things in other, perverse worlds? Do we need to have something beyond a prior, if we want to build an agent whose prior is too complicated for it to initially understand?

Here I’d like to introduce and discuss a toy example, which I think gives a clearer view of these issues.

### Encrypted Utility Function

Suppose we have a simple algorithm U which works by sampling universally distributed strings and scoring them, and a one way function f (that is, a permutation such that it is easy to compute f(X) given X, but hard to compute X given f(X)). Let V = f(U, x) for a randomly chosen x, and define the utility function U’ as follows: “Do a brute force search to find **U, x** such that f(**U, x**) = V, then return **U**().”

Now we record some input data S, and we run the algorithm A(S): “Output X such that, conditioned on A(S) = X, the expected value of U’ is maximized.” (In order to compute the conditional probabilities, A makes use of some sort of “math intuition module.”) We set up the computational environment such that A’s output will be sent to actuators, and then the inputs of the sensors will be appended to S to produce S’, and we will repeat the whole process with A(S’).

Suppose that Solomonoff induction on the input data S is sufficient to infer a lot about the world containing A, including the fact that the code of A was written by some humans, who produced U’ by applying f to U along with some random seed x (which can’t be inferred from S).

What will A do, in this world?

### Reasoning About Incomprehensible Utility

I suspect that, if A has a reasonable (and reasonably reflective) math intuition module, it should reason as follows (the rest of this section is from A’s perspective):

Probably U’ doesn’t depend at all on A(S)—after all, most things don’t. So, given that I find myself with an incomprehensible utility function, probably it doesn’t matter at all what I do. But I might as well ignore those cases, and just assume for the moment that U’ *does* depend on A(S). So: what then?

Well, based on S (which encodes many observations of the environment) and cleverness, I have noticed that there is a simple process W that explains where the code of A(S) came from. Namely, W is the process in which some humans (whose existence and properties can be inferred from S) start with the string Q (which I can also infer from S), generate x at random, and then outputting f(**Q, x**). Let Property P be the assertion: “There exists some **x** such that f(**Q, x**) = V.”

So now I am tasked with explaining: “Why is it that U'( )” depends substantially on A(S)? (I’m assuming that U’ does depend on A(S), because otherwise it doesn’t matter what I do.) We need to be a bit more careful about what we mean by “depends on”: what I mean by “U'( ) depends on A(S)” is that E[ U'() | A(S) = 0 ] is quite different from E[ U'() | A(S) = 1 ], which is itself a logical fact. Similarly, by “U'( ) depends on W( )” I mean that U’ depends on the computational history of W( )”.

One possible explanation for why U'( ) depends on A(S) is Hypothesis H:

U'( ) depends on W( ), AND Property P is true, so U'( ) depends on A(S). [This generically implies that U'( ) depends on A(S).]

Now we need some notion of a “logical prior,” which should be clear enough intuitively. We have four facts whose probabilities we are interested in:

- U'( ) depends on A(S).
- U'( ) depends on W( ).
- Property P.
- Hypothesis H.

My claim is that the “prior probability” of a short statement should be much higher than the prior probability of a long statement, unless there is some other implication which causes the long statement to be true. I.e., the prior probability of (2) should by default be higher than the prior probability of (1). The prior probability of Property P should clearly be |V|/|x|. The prior probability of Hypothesis H should be about prior probability of (2) * the prior probability of Property P.

So if the right relationships amongst the sizes of things holds, (1) is much more likely to be true as a consequence of (2), then “for it’s own sake.” (Of course there could be other, still better, hypotheses that explain why (1) was true, but let’s set those aside.)

And if we buy that, then if we condition on the worlds where (1) is true, we infer that (2) is probably true, property P is probably true, and U'( ) = U ( ).

### The Analogy

In the case of the indirect definition of the human utility functions, as in the case of the encrypted utility function, the mathematical definition of U is completely opaque until astronomical resources are available. However, in both cases a U-maximizer may reason about the physical process that gave rise to the existence of the U-maximizer, and in particular U, by treating the existence of such a physical process as a possible explanation for why a completely unknown U could depend on A(S). If the process carrying out that reasoning is philosophically sophisticated and conforms to intuitive notions of correctness then it is possible to learn about the utility function without having astronomical resources.