Suppose we found a black box which purported to be a halting oracle. You feed the box an input, and it outputs 0 or 1. As it turns out, we couldn’t verify such a claim. Nevertheless, we could tell that such a box was pretty magical. Exactly how much could we tell? That is, if you ask the box a question of the form “Does Turing machine M halt?” when can you verify that its answer is accurate? Note that, in order to do this verification, you may be able to ask the box some further questions. In general, you could use halting queries to make the box simulate a deterministic prover’s role in some interactive proof protocol (i.e., does the machine that finds the lexicographically first valid prover strategy and then halts iff the prover’s strategy outputs 1 at this stage, halt?)
If all you care about is computability, you can’t do anything. If you could learn something interesting from an untrusted halting oracle, you could have just looped through all strings until you found one that convinced you to trust it. The interestingness of the situation comes from a computational complexity perspective.
[Disclaimer: I don’t think this is useful.] Continue reading
This blog contains a lot of material that is very weird. A natural question is: why write about such weird things? Many of the situations I’ve considered or problems I’ve written about involve improbable hypotheticals. We probably won’t ever have a powerful utility maximizer which we must provide a precisely defined utility function, for example. Why not focus on designing systems that are likely to actually get built?
The reason should not be so unfamiliar to a mainstream academic: though these problems may be too simple to be directly applicable, thinking about them sheds light on fundamental philosophical and mathematical issues. The heuristic is something like: where there are simple, fundamental problems on which it is possible to make progress, you should do it, even if you can’t see exactly why or how that progress will be useful. Continue reading
Suppose we could build agents with arbitrary formally specified preferences, but that specifying human values turns out to be too challenging. A second approach to getting what we want is to give an agent relatively limited goals and capabilities, hopefully encouraging cooperation and helping us build more cooperative agents, or figure out how to better specify what exactly we want. If the agent is much smarter than us and perfectly selfish it may be a bit tricky; but if the agent is limited in some other way it may not be impossible. For example, you could try to build an agent which only cared about getting a cookie in the next hour, for some definition of “cookie” such that you are the only person who owns any cookies. Or you could try to build an agent which knew and cared only about abstract mathematical facts, and come up with useful problems it could solve in that domain. Right now I want to talk about the first approach, and highlight some difficulties. Continue reading
Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design machines to achieve a formally specified goal in physical systems which can be modeled algorithmically. What would happen, when these oracles became widely available? Continue reading
This post outlines a formalization of what Nick Bostrom calls “indirect normativity.” I don’t think it’s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the “not terrible” bar, i.e. which does not obviously lead to terrible consequences if pursued without any caveats or restrictions. Continue reading