Although I don’t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, however. The issue is complicated by complete ignorance of the hypothetical AI’s ontology or the mechanisms by which the creators might have access to that ontology.
I do have some fixed points: I believe that any AI will probably at a minimum be able to reason about logical and mathematical truth, and I believe that many natural goals will want to use the subexpression “a human’s decision process” (for example appealing to a human’s decision process to make some judgment about input sequences).
This leads to a natural goal: if all I can talk to the AI about is math, how do I tell it “here is a human”?
Here is an attempt, though I will later give some reasons it may break down (for now the failures are left as an exercise for the counterfactual inquisitive reader).
Take a box with a hole in it. Inside the box, put a human together with an MRI scanner, a monitor, a keyboard, with their input and output channels wired through the hole. Let the human interact with the monitor/keyboard for a while, providing appropriate input; for example, hold a video conference between the person in the box and some people outside of the box, have the person in the box play a video game, etc. Let I be the input to the monitor, let O be the output from the keyboard, and let S be the MRI scan data.
Once we have (I, O, S) in hand, we can try to formally specify the human’s decision process as follows. Pick a function F from the universal distribution, which takes as input a stream of bits I and returns a pair of output streams which we will interpret as (O, S). Restrict attention to functions such that the kth bit of O and S depends only the first k bits of I. Now condition this distribution on agreement with observation, namely, that when applied to the available prefix of I, the function outputs the observed prefixes of O and S. The resulting probability distribution allows us to estimate what the response of the human in the box would be, if it were provided with the input string I.
Hopefully, the simplest function capturing the behavior of the human in the box is a physical simulation of that human. We include the MRI scan to ensure that there is enough data that the parameters for the physical simulation are less complex than a cruder specification (of course the model also has to describe how the MRI works, and it has to spend many bits describing errors and deviations from its physical simulation, but the high bandwidth of the MRI means that the physical simulation rapidly gains probability over a slightly less accurate model). We include the keyboard as an output channel as a way of pinpointing the human’s “intention,” without having to solve arcane technical problems (relying on whatever internal mechanism our brains use for giving our intentions control over motor function).
This gives us more or less the most general sort of access to a human decision process that we would want (we can ask questions and elicit responses with about as high a bandwidth as a human can support, and we can run the resulting simulation as many times as we like from an identical starting state). Best of all, it was specified purely in math. As long as we have access to the mathful parts of the AI ontology, this trick shows that we can specify a human decision process with little additional sweat (modulo the issues I will describe in subsequent posts).