Ordinary Ideas

Formal Instructions Revisited

Advertisements

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate this civilization.”  Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences.

Suppose further that we can somehow define mathematically a counterfactual utopia–a possible world in which humans are likely to successfully complete whatever moral project we might optimistically hope for and build an FAI they are happy with. For example, perhaps we can mathematically pin down a brain emulation and describe algorithmically a (computationally unrestricted) environment in which the brain emulation can have free reign, and will have the ability to safely carry out extensive projects and instantiate other uploads.  We can provide the resulting civilization with a distinguished input channel and describe mathematically the distribution over possible messages they send on that channel.

Let P be the distribution over messages sent on this channel. Some messages can be interpreted as axioms describing potential utility U (or programs outputting real numbers). Since we can describe P mathematically, we can define a distribution over utilities U by conditioning P on being a valid description of utility and interpreting it in that way. Once we’ve described this probability distribution, we can specify our original agent’s utility as the expected utility if we use a definition drawn from this distribution.

With luck, the resulting utility function is as good a choice as we could have come up with, even if we had been given an eternity with no risk of catastrophe The civilization could produce a utility function expressing ideas like “do no harm, but do this instrumentally useful thing which will allow the original humans to build a real FAI,” for example. More optimistically, they could just find an appropriate dynamic directly which expresses human preferences (probably defined in terms of a mathematical description of human brains, or perhaps a catalog of all humans existing at the time the AI is run, or perhaps a description of agents from within this counterfactual utopia, or whatever).

There are a few obvious problems:

Advertisements

Advertisements