Specifying (non-decision-theoretic) Counterfactuals

Here is a simple trick for specifying a computer in the physical world’s future inputs: run the computer for a long time, and then ask for the simplest description of the resulting sequence of inputs. The resulting description is a good predictor for future inputs, provided we live in a suitable universe.

(This is vulnerable to all of the same attacks defined in “Hazards,” and if we really want to get access to the universe as a whole, rather than just to a simulation of a single brain, it will be much harder to get around these problems.)

Now suppose we have a single bit X on a computer, and we would like to talk about the counterfactual world in which X’s value was flipped. How can we do this? Or perhaps we would like to consider an entire ensemble of possible counterfactuals in which we were given one of exponentially many possible messages m1, m2, ….

Continue reading

Hazards for Formal Specifications

I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human’s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of agents’ behavior):

Continue reading

Short Explanations of Observations in Physical Worlds

[This post contributes nothing new.]

Consider the sequence of bits observed by a camera situated within the physical universe (which we can imagine as a CA for concreteness).  If we draw a program uniformly at random (i.e., fixing a universal prefix free encoding) and condition on agreement with this prefix, what does the posterior (over programs) look like?

Continue reading

Cellular Automata

In the interest of concreteness, I am going to talk about cellular automata (CA) a lot here. They serve as a convenient toy example for talking about computation, and particularly about structures embedded in computations (it is easy to think about how such structures exert control over their environment, although this is just as philosophically problematic as acausal control in general). CA have no relevant mystical properties. You could substitute any other sufficiently complicated program, but CA have the virtue of matching our intuition about physics in several ways (similar notions of space and time, of regular physical law, and so on). Whenever the intuition from CAs seems to get in the way of thinking about what is going on in generality I will abandon them.

Continue reading

Specifying Humans Formally (Using an Oracle for Physics)

Although I don’t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, however. The issue is complicated by complete ignorance of the hypothetical AI’s ontology or the mechanisms by which the creators might have access to that ontology.

I do have some fixed points: I believe that any AI will probably at a minimum be able to reason about logical and mathematical truth, and I believe that many natural goals will want to use the subexpression “a human’s decision process” (for example appealing to a human’s decision process to make some judgment about input sequences).

This leads to a natural goal: if all I can talk to the AI about is math, how do I tell it “here is a human”?

Here is an attempt, though I will later give some reasons it may break down (for now the failures are left as an exercise for the counterfactual inquisitive reader).

Continue reading

Decision Theory Desiderata

“Expected utility maximization” is a generally accepted decision procedure, at least up to computational limitations. But it is worth remembering that as of today, no one seems to have produced a formalization of EU maximization that leads to sane behavior. I’ll briefly review some of the properties we might like our decision theories to have, and observe that for the most part we can’t write code that satisfies any of them in much generality, even if we could deal with the classical AI problems of inference and optimization.

Continue reading

Beliefs Which are Always Wrong

In the game “matching pennies,” a player and a predictor simultaneously choose either Heads or Tails. The player wins a point if his play differs from the predictor and loses a point otherwise. We also offer the player the ability to opt out of the game for a small penalty.

Personally, after losing at matching pennies a relatively small number of times I will suspect that the predictor is able to reliably to predict my behavior and so will begin opting out. We expect that a reasonable AI will behave similarly.  But we run into difficulties when we try and formalize the reasoning that led to our behavior.

Continue reading

AIXI and Existential Despair

(cross-posted from lesswrong)

It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.

Continue reading

Clarification of AI Reflection Problem

(Cross-posted from lesswrong)

Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action.  By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI.  Our current understanding appears to be inadequate for guiding such an agent’s behavior, for (at least) the following reason:

If A does not believe “A’s beliefs reflect reality,” then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A’s beliefs don’t reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe “A’s beliefs reflect reality,” then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = “A does not believe Q”? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.

This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.

Continue reading

Introduction

I’m planning to use this blog to write down ideas as they occur to me. The writeups will be neither comprehensible nor pretty. They are here because they should be somewhere, but they shouldn’t burden anyone who didn’t come looking for them.

For the most part I will probably be writing about AI safety. I will also sometimes make comments about economics, rationality, and institution design, because these are topics that interest me. I should issue the disclaimer: I don’t know much about any of these subjects.