The motivated simulator argument

Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you’ve seen so far, it will correctly predict whatever you next see. In this post I’ll describe a problem with applying this approach to anything particularly important.

Continue reading

A formalization of indirect normativity

This post outlines a formalization of what Nick Bostrom calls “indirect normativity.” I don’t think it’s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the “not terrible” bar, i.e. which does not lead to terrible consequences if pursued without any caveats or restrictions. Continue reading

Formal Instructions Revisited

Previously I’ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form “Simulate this civilization.”  Here is a much better proposal for issuing formal instructions.

Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don’t have (formal) access to those preferences. Continue reading

Formal Instructions

I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like “Now take this decision process, embed it in this abstract world (where we believe it will be able to create a flourishing utopia or whatever) in this way, and make the universe look like that.” We’ve maybe gotten some leverage on the first parts (though right now the difficulties here loom pretty large), which involve precisely defining certain concepts for an AI, but it isn’t yet clear how you could precisely tell the AI to do something. Here is a stab at this other problem.

Rather than directly asking an AI to simulate a particular universe, we will ask it to find the value on a particular physical input channel in that universe, and then exert control from within the universe to ensure that calculating this value requires simulating the universe (or at least capturing whatever moral value we hope would come from a simulation of that universe). Continue reading

Specifying (non-decision-theoretic) Counterfactuals

Here is a simple trick for specifying a computer in the physical world’s future inputs: run the computer for a long time, and then ask for the simplest description of the resulting sequence of inputs. The resulting description is a good predictor for future inputs, provided we live in a suitable universe.

(This is vulnerable to all of the same attacks defined in “Hazards,” and if we really want to get access to the universe as a whole, rather than just to a simulation of a single brain, it will be much harder to get around these problems.)

Now suppose we have a single bit X on a computer, and we would like to talk about the counterfactual world in which X’s value was flipped. How can we do this? Or perhaps we would like to consider an entire ensemble of possible counterfactuals in which we were given one of exponentially many possible messages m1, m2, ….

Continue reading

Hazards for Formal Specifications

I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human’s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of agents’ behavior):

Continue reading

Specifying Humans Formally (Using an Oracle for Physics)

Although I don’t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, however. The issue is complicated by complete ignorance of the hypothetical AI’s ontology or the mechanisms by which the creators might have access to that ontology.

I do have some fixed points: I believe that any AI will probably at a minimum be able to reason about logical and mathematical truth, and I believe that many natural goals will want to use the subexpression “a human’s decision process” (for example appealing to a human’s decision process to make some judgment about input sequences).

This leads to a natural goal: if all I can talk to the AI about is math, how do I tell it “here is a human”?

Here is an attempt, though I will later give some reasons it may break down (for now the failures are left as an exercise for the counterfactual inquisitive reader).

Continue reading