# Formal Instructions Revisited

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate this civilization.”  Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences.

Suppose further that we can somehow define mathematically a counterfactual utopia–a possible world in which humans are likely to successfully complete whatever moral project we might optimistically hope for and build an FAI they are happy with. For example, perhaps we can mathematically pin down a brain emulation and describe algorithmically a (computationally unrestricted) environment in which the brain emulation can have free reign, and will have the ability to safely carry out extensive projects and instantiate other uploads.  We can provide the resulting civilization with a distinguished input channel and describe mathematically the distribution over possible messages they send on that channel.

Let P be the distribution over messages sent on this channel. Some messages can be interpreted as axioms describing potential utility U (or programs outputting real numbers). Since we can describe P mathematically, we can define a distribution over utilities U by conditioning P on being a valid description of utility and interpreting it in that way. Once we’ve described this probability distribution, we can specify our original agent’s utility as the expected utility if we use a definition drawn from this distribution.

With luck, the resulting utility function is as good a choice as we could have come up with, even if we had been given an eternity with no risk of catastrophe The civilization could produce a utility function expressing ideas like “do no harm, but do this instrumentally useful thing which will allow the original humans to build a real FAI,” for example. More optimistically, they could just find an appropriate dynamic directly which expresses human preferences (probably defined in terms of a mathematical description of human brains, or perhaps a catalog of all humans existing at the time the AI is run, or perhaps a description of agents from within this counterfactual utopia, or whatever).

There are a few obvious problems:

• With some probability the civilization will fail (e.g. be taken over by uFAI) and will output some utility function designed to take over our universe. To prevent this sort of failure, we should require each utility function to be bounded, say to have range [0, 1]. If we make this restriction and 99% of the mass of P lies on “good” utility functions, then we are guaranteed to get an outcome which maximizes a “good” utility function to within 0.01. (Of course, now we are stuck with a bounded utility function, but this is not such a serious issue since satisfying a bounded utility function may create an agent which maximizes an unbounded utility function. Here the bounded utility acts as a sort of “majority rules,” rather than an inherent obstruction)
• There may be no utility function which we want to plug into such an AI. This framework may just be a bad one, in which it is not possible to even express an idea like “do no harm,” much less human value (or, our decision theory may cause the AI to behave badly regardless of utility function for other reasons). In this case we have more serious trouble, but at least we would have had serious trouble anyway.
• Simulating the counterfactual utopia is likely to be much too computationally expensive for the AI to simulate using available resources. This isn’t really a problem, as the AI is encouraged to just take its “best guess,” understanding the mathematical/algorithmic definition of utility even if it cannot simulate the world in the definition–to the extent that human values cohere (in whatever abstract sense we might hope for) the AI can predict that some actions will be good without obtaining an exact statement of human utility.
• Defining a good utility U may depend on understanding facts about our universe. This seems unlikely, as the counterfactual utopia may run huge simulations of varied laws of physics and search for copies of their remembered history within those simulations (or output a utility function that does the same). So if this consideration bites us (say because our laws of physics are radically different than we expect) we will first probably run into the next problem:
• Our formal definitions of brain emulations, counterfactuals, moral bootstrapping, or whatever, may break down, for any one of many possible reasons.