# Specifying Humans Formally (Using an Oracle for Physics)

Although I don’t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, however. The issue is complicated by complete ignorance of the hypothetical AI’s ontology or the mechanisms by which the creators might have access to that ontology.

I do have some fixed points: I believe that any AI will probably at a minimum be able to reason about logical and mathematical truth, and I believe that many natural goals will want to use the subexpression “a human’s decision process” (for example appealing to a human’s decision process to make some judgment about input sequences).

This leads to a natural goal: if all I can talk to the AI about is math, how do I tell it “here is a human”?

Here is an attempt, though I will later give some reasons it may break down (for now the failures are left as an exercise for the counterfactual inquisitive reader).

Take a box with a hole in it. Inside the box, put a human together with an MRI scanner, a monitor, a keyboard, with their input and output channels wired through the hole. Let the human interact with the monitor/keyboard for a while, providing appropriate input; for example, hold a video conference between the person in the box and some people outside of the box, have the person in the box play a video game, etc. Let I be the input to the monitor, let O be the output from the keyboard, and let S be the MRI scan data.

Once we have (I, O, S) in hand, we can try to formally specify the human’s decision process as follows. Pick a function F from the universal distribution, which takes as input a stream of bits I and returns a pair of output streams which we will interpret as (O, S). Restrict attention to functions such that the kth bit of O and S depends only the first k bits of I. Now condition this distribution on agreement with observation, namely, that when applied to the available prefix of I, the function outputs the observed prefixes of O and S. The resulting probability distribution allows us to estimate what the response of the human in the box would be, if it were provided with the input string I.

Hopefully, the simplest function capturing the behavior of the human in the box is a physical simulation of that human. We include the MRI scan to ensure that there is enough data that the parameters for the physical simulation are less complex than a cruder specification (of course the model also has to describe how the MRI works, and it has to spend many bits describing errors and deviations from its physical simulation, but the high bandwidth of the MRI means that the physical simulation rapidly gains probability over a slightly less accurate model). We include the keyboard as an output channel as a way of pinpointing the human’s “intention,” without having to solve arcane technical problems (relying on whatever internal mechanism our brains use for giving our intentions control over motor function).

This gives us more or less the most general sort of access to a human decision process that we would want (we can ask questions and elicit responses with about as high a bandwidth as a human can support, and we can run the resulting simulation as many times as we like from an identical starting state). Best of all, it was specified purely in math. As long as we have access to the mathful parts of the AI ontology, this trick shows that we can specify a human decision process with little additional sweat (modulo the issues I will describe in subsequent posts).

## 6 thoughts on “Specifying Humans Formally (Using an Oracle for Physics)”

1. Pingback: Hazards « Ordinary Ideas

2. Mitchell Porter says:

This is an important and interesting topic (for people concerned with Friendly AI). However, there are definitely issues with the procedure you suggest.

A human in a box, communicating through a keyboard, is going to offer a strongly biased picture of how the human brain works. The use of heuristics by the human decision procedure is undoubtedly context-dependent, and this is a highly artificial context. Important chunks of the human utility function may never be called upon, and so may be overlooked by a simulation.

Also, if we are trying to extract a generic human utility function, relying on one person may lead to overfitting. What we get here is surely much much more complicated than what we ultimately want, if the aim is to extract a human-friendly value system. We don’t want an AI to believe that all its choices need to be measured against the preferences of someone who was stuck in a box once, possibly having a bad day, and whose brain was full of all sorts of idiosyncrasies and irrelevant microphysical complexities.

So ultimately you *do* have to solve the arcane problems like identifying the “internal mechanism our brains use for giving our intentions control over motor function”. Or perhaps you personally don’t have to solve them, but your algorithm for determining the nature of the human decision procedure will be performing an equivalent analysis. “Overfitting” is a nice familiar way to pose the question: In inferring the human decision procedure, by observing and simulating human beings, how do we avoid overfitting? Answering *that* question should take you a long way.

If the objective is just to simulate a particular human being… again, a microphysically exact physical simulation would not be the simplest way to simulate a person. It might be one of the simplest functions to ostensively *specify* (“simulate what’s happening in that box”), but it would be full of unnecessary details about ions.

On a practical level, an efficient way to develop a whole-brain model may involve repeated high-resolution fMRI, and the progressive development of finite-state machine models for the voxels, in the context of interactive experiments. That is, you’ll be modeling the brain as a lattice of finite state machines coupled to their neighbors. You get the data from the fMRI, and there will be a protocol of interaction with the subject designed to reveal the dynamics in ever-greater detail, until diminishing returns set in.

• The goal is to get access to the function “What this human in this box would say, if you presented it with stimulus X.” No idealization, and no concern for how the function is implemented. In particular, the AI may not even have to use the function itself–it may reason about the function’s behavior, perhaps even by observing the behavior of humans. In this case the mathematical definition is just a handle to get access to a particular concept in the AI’s ontology.

A separate issue is “Given a thing that responds to stimuli in this way, try to extract some abstract features of interest.”