I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like “Now take this decision process, embed it in this abstract world (where we believe it will be able to create a flourishing utopia or whatever) in this way, and make the universe look like that.” We’ve maybe gotten some leverage on the first parts (though right now the difficulties here loom pretty large), which involve precisely defining certain concepts for an AI, but it isn’t yet clear how you could precisely tell the AI to do something. Here is a stab at this other problem.
Rather than directly asking an AI to simulate a particular universe, we will ask it to find the value on a particular physical input channel in that universe, and then exert control from within the universe to ensure that calculating this value requires simulating the universe (or at least capturing whatever moral value we hope would come from a simulation of that universe).
For example, suppose that we have an AI which somehow uses a human’s decision process to define a counterfactual world W (in which some specified quantum coins happened to fall in a particular way controlled by [an extrapolation of] the human’s decision process; or perhaps in which the laws of physics are completely synthetic and the human decision process has the freedom to construct a desirable world), and we would like the AI to make the universe “look like W.” We imagine that the world W has a distinguished input channel I within it (defined as the simplest program which agrees with some initial values on a particular physical input channel). Moreover, the real universe also contains an input channel I’, defined in the same way.
In order to instruct the AI to make the world look like W, we tell it to value similarity between the values on I and I’, say to have utility equal to the longest common prefix between I and any suffix of I’. If we (living within W) control the value on I appropriately, this may lead the AI to simulate W, determine what values we would have put on I, and then put the same values on I’.
How would we want to set the values of I? We want to ensure that, in order to make I and I’ equal, the AI will simulate W and then change the values on I’, rather than running some other process to guess the values on I or trying to acausally control the value on I itself. This requires first that friendly processes control I (for example, if a predictable process or the AI itself gains control of I, the AI no longer needs to simulate W to make I and I’ agree). But it also requires that there be no computational shortcuts to computing the values on I–so that the AI is required to simulate W in order to compute the values on I. This actually seems rather straightforward: by making the values of I depend in a sufficiently chaotic way on the rest of the universe, it will be necessary to conduct an exact simulation of W in order to make I’ remotely similar to I. We also need to ensure that it is possible for the AI to make I’ and I similar, or else it will recourse to some more desperate measures. This will hold if we ensure that the world W is sufficiently simple that it can be simulated in the real world–say by creating a giant computer within W and simulating our civilization on that computer, then making I depend only the behavior of that simulation.
The following approach seems to be very conservative, supposing that consciousness is a property of computation and that we put moral value primarily on the content of conscious experience (it should be clear how to generalize to other settings, and these choices can all be made after much deliberation within W). Each conscious process which wants to ensure it gets simulated computes a cryptographic hash of some large subset of its own computation history. So e.g. I record all of the heat coming off of my brain and hash it to obtain a 10000 bit summary of my conscious experience. We then hash all of these histories together (using a Merkle tree if you like) and produce a 10000 bit summary of the universe’s moral value, and send this over the channel I. We repeat this process periodically, encouraging an AI to try to simulate as much of our collective history as possible before exhausting available resources. If the cryptographic hash function is collision-resistant against the universe (based on our current understanding of complexity theory we suspect such functions exist), this has the property that no single agent in W can substantially influence the value of I, and moreover that an AI trying to make I’ similar to I can only do so by computing all of the inputs to this process–ie, by replicating the entire history of all conscious processes which we value.