<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Ordinary Ideas &#187; AI Safety</title>
	<atom:link href="http://ordinaryideas.wordpress.com/category/ai-safety/feed/" rel="self" type="application/rss+xml" />
	<link>http://ordinaryideas.wordpress.com</link>
	<description>As advertised</description>
	<lastBuildDate>Wed, 31 Dec 2014 03:26:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='ordinaryideas.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title> &#187; AI Safety</title>
		<link>http://ordinaryideas.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://ordinaryideas.wordpress.com/osd.xml" title="Ordinary Ideas" />
	<atom:link rel='hub' href='http://ordinaryideas.wordpress.com/?pushpress=hub'/>
	<item>
		<title>Approval-seeking</title>
		<link>http://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/</link>
		<comments>http://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/#comments</comments>
		<pubDate>Mon, 21 Jul 2014 23:08:25 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=350</guid>
		<description><![CDATA[I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances. I described this proposal in a previous post; however, [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=350&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.<br /> <span id="more-350"></span></p>
<p>I described this proposal in a <a href="http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">previous post</a>; however, that post focused on technical details, and presented an implausible but extremely precise version. In fact I believe that this is a particularly promising angle for thinking about what we would like powerful agents to do. Here I&#8217;d like to look at the more qualitative picture and provide some of the basis for these intuitions.</p>
<h1>What exactly is the proposal?</h1>
<p>I want to define a utility function such that I would be happy if it were exceptionlessly maximized, for example by a powerful AI. The proposal is intended to be straightforward rather than particularly novel or ingenious.</p>
<p>The schema for my proposed utility function is: &#8220;The extent to which <strong>I</strong> would <strong>approve of the AI&#8217;s decisions and their consequences</strong> after <strong>an extensive process of reflection</strong>.&#8221; The three bolded terms require further specification.</p>
<p><strong>&#8220;I&#8221;</strong> refers to the decision-making process that I currently implement (rather than to a concept which could be modified by e.g. changing my physical constitution or making a persuasive argument). I mean to refer to an input/output behavior, without concern for its internal structure.</p>
<p><strong>&#8220;Extensive process of reflection&#8221;</strong> is used to refer to a situation in which a decision-making process is granted extremely extensive resources and may choose how to use them in order to arrive at a conclusion. In this case the conclusion is whether or not to approve of the AI&#8217;s behavior (see below). These resources are intended to be very extensive; for example, they should be sufficient to implement a successor which is considered to better represent the original processes&#8217; &#8220;considered judgment&#8221; or any number of such successors; for example, this might be a smarter or more experienced version of itself, a committee of experts, a formalization of &#8220;what I would say if I didn&#8217;t make epistemic errors,&#8221; etc.</p>
<p>Note in particular that the original decision-maker may appoint a successor defined in terms of further stages of reflection, so that the process is effectively iterated until convergence.</p>
<p>Although this definition may appear very philosophical or vague, it can be made precise by clearly specifying &#8220;extensive resources&#8221; and the format in which the successor is described. Also note that &#8220;extensive resources&#8221; is intended to considerably exceed what is actually available to any agents existing in the real world, perhaps including access to unbounded computational resources which could be used to perform extensive searches for simple computations . The intractability of the procedure does not preclude the AI&#8217;s ability to reason about what its output would be, any more than it prevents us from reasoning about whether or not a particular action would be likely to be met with approval after massive amounts of reflection. However, it allows the evaluator to evaluate possibilities which are beyond the comprehension of any living person, and ensures that the evaluator can (predictably) see through any attempts by the AI to game the system.</p>
<p><strong>&#8220;Approve of the AI&#8217;s decisions and their consequences&#8221;</strong> may mean that the evaluator has access to the literal list of decisions made by the AI, and may then use its own great resources to evaluate their consequences in detail. It then outputs a number in [0, 1] representing the extent to which it approves of this behavior.  The AI&#8217;s goal is to maximize this number.</p>
<p>In a more extreme interpretation the evaluator may not have access to the outputs of the AI directly, but may be tasked with inferring them from what it already knows about the world, the code of the AI, and so on. I think this is a much more attractive, though less intuitively comprehensible, proposal: &#8220;outputs of the AI&#8221; does not seem to be a reliably meaningful concept, and in general it seems good to leave as much conceptual work as possible to the evaluator rather than to us. In this case the process of reflection would need to continue for a very long time, before a radically transformed version used a massive supply of resources to make appropriate inferences about the AI&#8217;s behavior. In this case the AI would also need to use a decision theory such as EDT/UDT/TDT, rather than CDT, since the AI&#8217;s behavior has no causal impact on the evaluator&#8217;s evaluation&#8211;the entire impact flows through what the evaluator will infer about how the AI would behave.</p>
<p>Note that the concept of &#8220;approval&#8221; need not coincide with some intuitive standard of behavior like &#8220;not actively bad;&#8221; instead, the evaluator can set the line arbitrarily. I use the expression &#8220;most approve of&#8221; rather than &#8220;think best,&#8221; because I want to emphasize the bounded nature of the judgments, and stress that the evaluator is free to avoid bad outcomes by disapproving of them, even if (in the AI&#8217;s judgment) there is a small probability that those outcomes would be judged extremely good. Without the boundedness condition I would not expect the procedure to lead to good outcomes, and if the evaluator exclusively judged the AI&#8217;s behavior by how good the resulting outcome was, I would not necessarily expect good outcomes.</p>
<h1>Why is this desirable?</h1>
<p>The most straightforward virtue of this proposal is that it is simple, can be made relatively precise, and appears likely to lead to good outcomes if implemented in the anticipated way. Indeed, there is some reason to think it would be as good as any proposal that we could think of (since we could arrive at such a proposal on reflection). See <a href="http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">this post</a> for an example of an attempt to make this proposal completely precise, and the later section of the same post for some arguments that the resulting outcome would be good. By contrast, I am not aware of many simple proposals that lead to desirable outcomes, or which have any realistic prospect of being rendered precise.</p>
<p>Perhaps more importantly, this proposal is <em>robust</em> in a number of very important respects</p>
<p><strong>Robust to error: </strong>if we attempt to specify what we value directly, we seem to run into a host of potential errors. We may be wrong about what we want, may use the wrong framework for reasoning under normative or empirical uncertainty, may have an incorrect view of epistemology or have incorrect prior beliefs, may make decisions in the wrong way, may be vulnerable to blackmail, may be incorrectly obsessed with small probabilities of big payoffs, may incorrectly ignore small probabilities of big payoffs, may place too much or too little emphasis on further philosophical inquiry, <i>etc.</i></p>
<p>The approval-based proposal has the virtue that it can avoid essentially all of these problems, by (1) delegating the underlying philosophical questions to the process of reflection, which is in a substantially better position than we are to resolve them, (2) allowing the evaluator to encourage the AI to revise arbitrarily fundamental aspects of its framework. For example, the evaluator could adopt a rule like &#8220;I approve of behavior if and only if it leads to the replacement of this AI by a new AI using framework Q.&#8221; This encourages an AI to reason at a very fundamental level about what kinds of design decisions its creators would have made differently given more reflection, and to make appropriate changes if it thinks the evaluator would prefer that it do so. It also allows the AI to make principled tradeoffs between course-corrections and more immediate concerns in a way which it believes best reflects the desires of its creators (since its creators can decide whether or not they feel it made such tradeoffs in an appropriate way and approve or disapprove as appropriate).</p>
<p><strong>Robust to unfavorable circumstances: </strong>many proposals based on the promise of active collaboration between machine and human intelligence assume either that such collaboration is nearly as productive as unilateral action by machine intelligences, or assume that users of machine intelligences will coordinate to make (potentially modest) productivity sacrifices in exchange for greater fidelity. But in reality &#8220;keeping a human in the loop&#8221; may be difficult, especially in the long run as human labor becomes very expensive compared to computing resources. This solution does not require any assumptions about the availability of human labor, and it allows an AI to develop as rapidly as possible.</p>
<p><strong>Robust to uncertainty:</strong> under realistic conditions a machine will be uncertain about what we would approve of. Because approval is bounded in [0, 1] there are limits on how much this uncertainty can distort its behavior. For example, suppose that the machine entertains a probability of 0.1 that we will behave in a very counter-intuitive way. If we behave in a counter-intuitive way, it may want to take counter-intuitive actions to maximize our approval. These might be actions which we would not approve of if we behaved in the intended way. However, because the machine can only receive 0.1 more expected approval by satisfying our counter-intuitive judgment, this consideration will never justify sacrificing more than 0.11 expected approval <em>conditioned</em> on our behaving in the anticipated way. It seems quite likely that we will have a large enough margin of error that this will make almost no difference to the long-run outcome (see the other points in this section).</p>
<p><b>Robust to conceptual imprecision: </b>when we use language with humans, the best guarantee we typically obtain is that our words probably mean approximately the same thing to us and to them, in the kinds of situations we have encountered so far. The meaning may not be exactly the same, and in unfamiliar environments the concepts may diverge considerably. Let&#8217;s say that a concept communicated in this way is (at best) &#8220;probably approximately correct&#8221; (PAC).</p>
<p>Maximizing a PAC concept of goodness is likely to be problematic; if there are any possible outcomes on which the PAC concept and our actual concept diverge, then the process of maximizing the concept is unusually likely to push us into regions of divergence. These problems become even greater if the domain in which an agent is quite different than the one in which the concept was trained, for example if an agent is making decisions which will have implications for the long-run future.</p>
<p>However, the approval-seeking proposal appears to be robust to the use of PAC concepts for a number of reasons:</p>
<ul>
<li>&#8220;What I would say&#8221; may vary as the concept of &#8220;me&#8221; is varied, but on most readings of &#8220;me&#8221; I will tend to say things that are similar <em>on the axes that are most important to me</em>.  The variation due to conceptual uncertainty is channeled into dimensions where I am particularly ambivalent.</li>
<li>By channeling &#8220;what I would say&#8221; through a process of reflection, we can avoid asking &#8220;what I would say&#8221; in some extreme situation. For example, we need not ask &#8220;would I approve of your decision if I somehow understood all of the long-term consequences?&#8221;; this is good, since PAC concepts of &#8220;what I would say&#8221; might diverge in such an unusual and extreme case (in which we would have to extrapolate what it means to &#8220;really understand&#8221; something far outside of the usual range of human abilities). Instead, we can ask &#8220;What kinds of decision-making procedures do you think would be slightly wiser than your current self?&#8221; This is a question which can be posed to me as I currently exist, and two people who agreed about my behavior in a broad enough range of everyday situations would be likely to agree about my answers to this question.</li>
<li>The process of reflection and approval can be specified unusually precisely; for example, they can be given a purely formal definition in terms of a given input/output behavior. This ensures that only the concept of &#8220;I&#8221; is subject to significant conceptual imprecision.</li>
<li>The concept of &#8220;my decision process&#8221; also admits a relatively precise definition, which minimizes the extent of imprecision. This is because the physical instantiation of a person can be relatively cleanly separated from their environment, and can be transplanted from one physical environment to another. This includes, for example, an environment in which they interact with a computer. When a person interacts with a computer, they can implement a completely precise input/output behavior (the computer provides an input on the monitor, the human provides an output by typing). Putting these ingredients together, we can relatively unambiguously understand a human as implementing a precise input/output behavior.</li>
<li>The process of approval can act on behaviors in ordinary environments rather than merely acting on the long-term consequences of an AI&#8217;s decisions, and so can yield good outcomes even if no extreme situations are ever encountered. For example, we might approve of an AI&#8217;s behavior only if it avoids doing anything that unnecessarily reduces human influence, or we can approve of the AI&#8217;s behavior only if it avoids making radical changes to its environment. At the same time, the process of approval can meaningfully depend on the long-term consequences of outcomes, and not merely short-term outcomes.</li>
</ul>
<h1>Comparison with CEV</h1>
<p>This proposal has many important similarities to Eliezer Yudkowsky&#8217;s <a href="https://intelligence.org/files/CEV.pdf">CEV</a>; I read that document some time ago and I expect it has had an influence on my thinking. The primary difference is that this proposal relies on iterating the single relation, &#8220;A &#8211;&gt; decision-making procedure that A endorses&#8221; defined in a relatively straightforward way&#8211;for example by presenting A with the tools to define a new decision-making procedure&#8211;rather than by relying on a more complex understanding of motivation, reflection, and idealization. One consequence is that this proposal is more amenable to formalization. Another is that this proposal leaves room for A to err (or for A&#8217;s replacement to err, and so on).</p>
<p>The difference seems to be due primarily to disagreement about &#8220;how good does the initial decision-making dynamic need to be, in order to avoid an unacceptable risk of significant error?&#8221; My overall view is that we can adopt a relatively straightforward rule such as this one and improve the rule with relatively straightforward modifications (such as taking a long time to think, consulting many copies of ourselves, having a discussion with a slightly enhanced version of ourselves to verify that their reasoning remains convincing to us, and a very wide range of less pithy techniques). If it is important to resolve hard philosophical problems, for example those required to define &#8220;volition,&#8221; this can then be done from a much-enhanced future state. It may be that a relatively straightforward rule is <em>not</em> sufficient, but in that case it seems clear that the existing human society is already doomed: after all, the process we are currently using to decide what successors to create is just one example of a particular simple rule for deciding on a decision-making procedure that we endorse.</p>
<p>A more charitable interpretation of this criticism is that <em>one person</em> operating in isolation is at a significant disadvantage, when compared to the existing community of thinkers. I find this objection relatively uncompelling, because the move from &#8220;what I would do&#8221; to &#8220;what I would do with appropriate experimental feedback&#8221; to &#8220;what a bunch of copies of me would do with appropriate experimental feedback&#8221; to &#8220;what our society would do under ideal conditions with appropriate experimental feedback&#8221; does not seem very complex or difficult. I agree that it may be more robust to work out formal versions of those decision-making procedures in advance, prior to communicating the objective to an AI, rather than trying to bake them into the reflective process. However, if this is the only sticking point with this proposal, it seems to be an engineering difficulty which should be addressed in the future as the process arises.</p>
<p>Another possible complaint is that CEV is handicapped in a way that is advantageous, by preventing the individuals owning the AI from securing a personal advantage. By contrast, I am mostly interested in the transition to AI because I think it presents a unique opportunity for the influence of humane values to be diminished. From this perspective, I am happy with proposals which simply allow individuals to maintain control over AI&#8217;s of their own construction, and I expect this would lead to a world with broadly similar prospects to our own. It may further be the case that a transition to AI could lead to massive inequality; I would prefer tackle this as a separate (probably primarily political) problem. I am open to the possibility that technical work could help address that problem, but I don&#8217;t see a comparably good argument that we should be doing that technical work now. (If I wanted to work on that problem now, I would probably pursue policies aimed at promoting more equal ownership of capital, or policies aimed particularly at ensuring that the fruits of an AI-based productivity boost were broadly shared.) It may also be the case that one could use a transition to AI as an opportunity to achieve a substantial change in social values, but in general I would be quite hesitant about such proposals.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/350/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/350/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=350&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Adversarial collaboration</title>
		<link>http://ordinaryideas.wordpress.com/2014/07/18/adversarial-collaboration/</link>
		<comments>http://ordinaryideas.wordpress.com/2014/07/18/adversarial-collaboration/#comments</comments>
		<pubDate>Fri, 18 Jul 2014 15:44:26 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=355</guid>
		<description><![CDATA[Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it&#8217;s easy to get useful work out of them. For example, suppose I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout. Then [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=355&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Suppose that I have hired a group of employees who are much smarter than I am. For some tasks it&#8217;s easy to get useful work out of them. For example, suppose I am interested in finding a good layout for the components on a chip and can easily evaluate the quality of a proposed layout. Then I can solicit proposals, test the proposals, and award the employees according to the results.</p>
<p>However, for some tasks there may be fundamental problems with such an incentive-based policy. For example, suppose I am interested in hiring these employees to design an AI which is smarter than any of us and will act autonomously in the world. In cases like this, I will have to try something different.</p>
<p><span id="more-355"></span></p>
<h1>The challenge</h1>
<p>In this case there is no clear way to evaluate the performance of my employees even <em>ex post</em>. I can build whatever AI they propose and test its performance in a range of environments. But these tests could never incentivize my employees to make an AI that shared <em>my</em> values, rather than one that shared <em>their</em> values: an AI that shared their values would be motivated to maximize the reward given to my employees&#8211;including looking <em>as if</em> it shared my values, as long as I was watching.</p>
<p>In this context, rather than blindly testing proposals I should understand them first, and reward my employees for making proposals which I can understand and which are clearly good. Even if I can&#8217;t think of a way to design a better AI, I may be able to recognize a good idea when I see one.</p>
<p>But this approach focuses my employees&#8217; ingenuity on finding proposals which will be <em>convincing to me</em> rather than <em>actually good</em>. I can try to remove any other incentives for my employees, but even then I should worry that most compelling ideas aren&#8217;t actually good.  This becomes increasingly problematic as the ideas become more complicated, and subtle problems become more and more likely.</p>
<p>Maybe we could just be so careful that it didn&#8217;t matter; we could hold a proposed idea to such a high standard that if it passes, it really must be good. In practice, this could require so much vigilance that it would be harder than coming up with the idea yourself.</p>
<h1>Adversarial collaboration</h1>
<p>Instead you could consult not one but two groups of employees: team A, whose goal is to provide an idea that you think is good, and team B, whose goal is to convince you that it isn&#8217;t. This might be a good idea even if you do have a clear performance metric, to help identify its possible shortcomings, efforts to game the system, or unintended consequences of an idea.</p>
<p>(Parts of this proposal are based on discussion with Katja Grace. If you haven&#8217;t figured it out by now, the whole thing is a metaphor for the AI control problem.)</p>
<p>The rest of this post will describe some issues with and elaborations on this proposal.</p>
<p><strong>Capabilities: </strong>Ideally team B would be smarter, know more, and have longer to deliberate than team A. How much smarter? The tradeoff depends on the relative costs of taking longer vs. incorrectly accepting a bad idea. If an idea is goo enough, then we might still think it is good no matter how smart team B is.</p>
<p>In particular, it would be great if team B could know everything that team A knows, and could eavesdrop on as much of their thought process as possible. This would make it very difficult team A to be deliberatively misleading. In the case of machine intelligences, it might be possible for the two teams to literally share the same mental state.</p>
<p><strong>Weakening the critic: </strong>It may be <em>too easy</em> to convince someone that an idea is flawed. If team B is much smarter than team A, then they may always be able to find some reason that team A&#8217;s suggestion is suboptimal. I don&#8217;t think this ia  big deal. We can:</p>
<ul>
<li>Allow several rounds of rebuttals, in which team A can defend their idea against criticisms, or can change their ideas to address flaws.</li>
<li>Avoiding holding team A to <em>too</em> high a standard. Eve if team B can propose an improvement to team A&#8217;s plan, at some point we can satisfy ourselves with a &#8220;good enough&#8221; plan even if it isn&#8217;t perfect.</li>
</ul>
<p>It&#8217;s worth remembering that team B might be motivated to try any trick at their disposal to convince us that the proposed idea is no good. They will claim that the problems are serious but too subtle to easily explain, will make desperate emotional appeals, will try to scare or threaten the implementer, and so on. To cope with this, we might call some outcomes a tie&#8212;we could refrain from using the idea, but still decline to give team B a reward.</p>
<p><strong>Problematic incentives: </strong>Ulterior motives for either team could create problems. Team B might go soft on an idea which they want us to accept (and knowing this team A might propose an idea that team A will like). So we should ensure the incentives are large enough that both teams care about winning much more than they care about whether wee accept their ideas.</p>
<p>In particular, if we are going to do multiple rounds of back-and-forth, we should use independent teams. Team A shouldn&#8217;t be able to get more reward if their original plan is shot down. Instead we should hire a new team A&#8217;.</p>
<p><b>Preemptively defend against criticisms?: </b>we could set up the game in two extreme ways.</p>
<ul>
<li>Team A is rewarded when they propose a plan which sounds good. This may lead to many rounds of rejections, as team A constantly makes good-sounding but flawed proposals, since team A has no motivation to fix hard-to-spot flaws.</li>
<li>Team A&#8217;s goal is to make a plan which survives scrutiny.</li>
</ul>
<p>The first option is better, if it works. If team B can&#8217;t find any holes in a generic idea that sounds good to us, we should feel pretty comfortable. We should be (slightly) less comfortable if team A engineered their plan so that team B couldn&#8217;t explain its problems to us.</p>
<p>One could iterate the first option several times, with each round leaving us increasingly savvy to possible flaws in team A&#8217;s plan. Eventually this startsto run into the same concerns with overfitting.</p>
<p><strong>Team C: </strong>We could try to capture the best of both worlds by introducing a third team C, which is better-equipped than either team A or team B and operates in one of two ways:</p>
<ul>
<li>Team A attempts to produce plans that will survive criticism by team B. After a proposal survives criticism by team B, team C enters and attempts to identify further flaws. Team A does not care whether or not its plan is overturned by criticism by team C.</li>
<li>Team A attempts to produce plans that will be superficially plausible, without concern for whether they will survive criticism by B. Before team A makes a suggestion, team C tries to provide coaching which minimizes our probability of inadvertently accepting a &#8220;bad&#8221; plan. That is, team C is penalized if we accept a plan from team A which is later overturned by criticism from team B.</li>
</ul>
<p><strong>Other mechanisms: </strong>this approach could be combined with other mechanisms designed to get useful work out of much smarter employees. For example:</p>
<ul>
<li>Bootstrapping: ask smart employees how to build more effective systems for getting information out of smart employees.</li>
<li>Throttling: work with employees who are smarter than you but only by the minimum amount necessary to get the job done.</li>
<li>Restricting the range of proposals: limit team A&#8217;s options, and stick with what we can most easily understand.</li>
<li>Self-verifying proposals: ask for ideas that help verify their own desirability. (Some ideas can fail in subtle other ways; others are demonstrably bad if they are bad at all. Encourage team A to look for the second kind of ideas.)</li>
<li>Ex post verification: we can give out the reward only after we have seen an idea in action. This ex post verification can be assisted by team B, who can help us design good tests and think of possible shortcomings we wouldn&#8217;t have looked for.</li>
</ul>
<p>None of the ideas in this post are silver bullets. And if they were needed, the would-be employer would hopefully spend more time thinking about the problem than I have&#8211;and they would have more contextual information.</p>
<p>The point of this post is to help flesh out our understanding of how hard it is to delegate high-stakes problems to very smart, very cutthroat reward-seekers. The bigger goal is to better understand how hard the AI control problem is (and where the largest difficulties lie).</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/355/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/355/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=355&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2014/07/18/adversarial-collaboration/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>The motivated simulator argument</title>
		<link>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/#comments</comments>
		<pubDate>Thu, 24 May 2012 16:56:22 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>
		<category><![CDATA[Priors]]></category>
		<category><![CDATA[Solomonoff Induction]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=214</guid>
		<description><![CDATA[Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=214&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you&#8217;ve seen so far, it will correctly predict whatever you next see. In this post I&#8217;ll describe a problem with applying this approach to anything particularly important.</p>
<p><span id="more-214"></span></p>
<p>If we look at the <i>simplest</i> programs that account for your observations, once you&#8217;ve observed enough things (say, many more observations than there are bits in the description complexity of the universe), Solomonoff induction finds predictors of the form, &#8220;Simulate the universe with description D. Then look inside that universe, at the spacetime trajectory T, and return the bits encoded along that trajectory (using some encoding protocol P).&#8221; Thus, Solomonoff induction approximately reduces to what is called <i>counting measure</i> over observers within our own universe&#8212;the probability that you will next observer O is roughly the fraction of observers who have seen everything you&#8217;ve seen so far and next observe O. (This isn&#8217;t quite right, because observers later in time are harder to specify&#8212;it takes more bits to write down their spacetime coordinates&#8212;and so they get slightly discounted. There are a few other issues like this, but they don&#8217;t change the basic picture.)</p>
<p>But if you apply Solomonoff induction, and you plan to <em>use the result of this computation for something <a href="http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">really important</a></em>, most of the probability mass is actually controlled by sequences which have been constructed artificially&#8211;either within a simulation (see the<a href="http://www.simulation-argument.com/"> usual simulation argument</a>) or just as a sequence of inputs written down in really big letters over and over again, for the precise purpose of manipulating the counting measure. So, for example, if I build an AI and try to incentivize it by saying &#8220;if you do the right thing, I will provide a reward,&#8221; and if that AI uses approximate Solomonoff induction, <i>and</i> if the AI was positioned to steal 1% of the universe by defecting (though we&#8217;ve set up things so cleverly that it would first get punished if it tried)&#8230; <i>then</i> we might expect some future folk to control that AI by building simulations of it, and rewarding the simulations if and only if the AI tries to defect and give its 1% of the world to those manipulative future folk. (Who could then&#8212;perhaps?&#8212;use this 1% of the world to manipulate the original AI.)</p>
<p>It&#8217;s a wacky situation, but I think at the kernel is an interesting phenomenon. In this post I&#8217;m just going to try to argue that this is, in fact, what would happen. To get a handle on things, I&#8217;m going to start by looking at similar situations in <a href="http://ordinaryideas.wordpress.com/2011/12/14/cellular-automata/">cellular automata</a>.</p>
<h2>Deterministic Cellular Automata</h2>
<p>Suppose that we have some (modestly complex) cellular automata, which contains some observers doing science. The observers want to find a formal mathematical model for their physics (and, say, use it to direct an AI). So they take a sensor, put it at rest in the cellular automata&#8217;s preferred rest frame, let it sense for a while, and then condition the universal prior on agreement with the sensor&#8217;s inputs. Then they ask: what does that distribution predict the sensor will sense next?</p>
<p>What do they hope for? They hope that the distribution will predict whatever the sensor &#8220;actually&#8221; senses. So that, for example, if I gave an AI the task of making sure the future inputs on the sensor recorded us pressing an APPROVE button, the AI would really do something that would cause us to press APPROVE.</p>
<p>Suppose the sensor&#8217;s input is X. What short programs output X? Well, if the cellular automaton is relatively simple, one simple program will be P, defined as follows. P has the laws of physics, a mechanism for translating the sensor&#8217;s raw physical state as bits, the position of the sensor and the start time hard-coded into it, and P works by running the laws of physics, looking in the spacetime locations specified, and running the specified function to transform the sensor&#8217;s physical state into bits. If the sensor is particularly simple&#8211;for example, if it runs on a regular clock, and if there is a particular cell of the cellular automata which perfectly correlates with the sensor&#8217;s reading, then P is an extremely simple program, and generally P is much simpler than any &#8220;local&#8221; algorithm for predicting X unless X is very short or the sensor is very well isolated from the environment.</p>
<p>But for any other compactly specified sequence S of spacetime points,  there is a different program Q(S) which outputs the sequence of observations embedded at those points. Each program Q(S) has roughly comparable complexity, including P; programs with larger coordinates are slightly more complicated, but not much more so&#8211;doubling the position of a point only increases it&#8217;s description complexity by a constant, decreasing the relevance of that point&#8217;s contributions by only a constant factor.</p>
<p>So the part of the universal distribution made up of the programs Q(S) is roughly a counting measure over observers within the cellular automata (time and space-discounted by something <em>very slightly</em> superlinear, just enough to make the sum converge). But most of the counting measure really does lie within simulations. And if someone has an incentive to manipulate such simulations, or to simply directly embed the sequence X in the universe, then <em>their motivations</em> will control the predicted continuation of X under the universal prior.</p>
<p>It seems that this may be a deal-breaker when considering the universal prior for any application of this form. If you build an AI that wants the approve button to be pressed and you use this technique to define its utility function, you aren&#8217;t going to get an agent that cares about the &#8220;real&#8221; world. Instead, you will get an agent that cares about all of the simulators which have motivations to mess with it. Maybe that leads to good outcomes&#8211;it just depends on whether values we like control most of the available computational resources (and how much of their resources would be wasted fighting these sorts of fights).</p>
<h4>Complications</h4>
<p>If the sensor in question is harder to specify&#8211;whether because it is traveling or timed or interpreted inconsistently, this technically makes the above argument stronger. After all, then P is contributing less to the universal prior.</p>
<p>However, it may be that <em>every</em> embedding of X in the universe is rather hard to locate, so that even an incentivized agent cannot really write down X in a way that can be picked out by a simple program. If these difficulties were severe enough, then programs Q(S) may collectively not contribute much to the measure of X. Instead, most of the mass may come from simpler &#8220;local&#8221; prediction algorithms, like those humans use to predict/compress their environment, which are ignorant about the exact state of the rest of the universe but fill in the details by guessing.</p>
<p>But this seems unlikely. The basic problem is that the difficulty of picking out an input sequence seems <em>tiny</em> compared to the difficulty of filling in all of the unknown details of the universe. The extra complexity of specifying timing info, movements of the substrate in which X is written, and whatever messy algorithm is necessary to interpret X, seem like they must be at most a modest constant (independent of length of X); once the length of X is significantly larger than that constant, the losses from guesswork will greatly exceed this constant, unless the sensor is very well isolated from the rest of the environment. Intuitively, the sensor is (at a minimum) well enough connected to the environment if its observations are enough to infer many properties of the outside world (for example, if the sensor saw enough to infer something about an observer&#8217;s knowledge of cosmology, this would give an advantage to models at large enough scales to account for those cosmological features).</p>
<h2>Randomized Automata</h2>
<p>For randomized automata the case is rather different. In order to pick out a sequence of observations, generically, we need to provide not only coordinates but also the entire random seed. The length of the necessary random seed is not just long, it&#8217;s much longer than any input sequence X which we might plausibly extract from the world (and more randomness is injected at each step, which guarantees that you always need more randomness than you get out information). So each program Q(S) is contributing rather little to the mass of X under the universal prior&#8211;the probability that the sequence of observations at S would be equal to X.</p>
<p>I strongly suspect that we attain the same conclusions in the randomized as in the deterministic case, however.</p>
<p>What simple programs other than Q(S) explain X? As above, the natural candidates are coarse &#8220;local&#8221; explanations, which may maintain a detailed model for the local environment, but not the rest of the universe. But it seems unlikely that the randomness of the underlying cellular automata really favors such explanations. Indeed, consider some local observations distributed under the real cellular automata. There are two cases: (1) the distribution of observations under the local predictor is the same. In this case, the local predictor is losing out for the same reason that local predictors lose out in deterministic phenomenon. (2) The distribution of observations under the local predictor is different. In this case, by definition, the average log probability assigned by the real distribution is better than that of the local predictor, so the local predictor does worse than in case (1).</p>
<p>What other programs might there be? I don&#8217;t know; I can&#8217;t think of any. Generically it will be impossible to rule out the possibility of some other, still shorter, algorithms to account for our observations, but in this case I don&#8217;t know why we would expect one to exist.</p>
<p>I think the sense that something tricky is going on here (at least, I have that sense) is coming from different intuitions being applied to &#8220;coordinates&#8221; in some exponentially large probability distribution vs. uncertainty in an algorithm like the ones humans use. Of course the two are precisely equivalent, but it feels like locating myself in a universe parametrized by an astronomically large random seed is very hard, while it feels like my local prediction algorithm is pretty good.</p>
<p>The case for our universe seems to be broadly analogous for a general randomized cellular automata. Yes, we are less sure about how we might encode a program which extracts observation sequences from the world, but it would be surprising if these obstructions were insurmountable, for the same reasons as given in &#8216;Complications&#8217; above.</p>
<h2>Avoiding the Trouble</h2>
<p>The obvious way to avoid this sort of thing is to avoid the universal prior. I mentioned <a href="http://ordinaryideas.wordpress.com/2011/12/21/avoiding-simulation-warfare-with-bounded-complexity-measures/">before</a> the possibility of using a prior which penalizes algorithms which use a lot of time or a lot of space. This seems to rule out these particular problematic models, and force you to use a coarse local model. It is probably worth thinking a bit about how those coarse local models generically perform (say, in cellular automata).</p>
<p>It seems like there should be a more philosophically sophisticated way to point to &#8220;basements&#8221; than this, though; using space-bounded models feels like a bit of a hack, and throws out a lot of information. What we really care about, in general, is some measure of &#8220;generic&#8221; decision-theoretic relevance (or relevance to our personal values). We care about non-simulated copies of ourselves because they have much more influence on the world. Maybe there is some way to access this measure directly. Alternatively, it may be possible to bargain with agents as a way of explicitly eliciting predictions with respect to this measure&#8212;maybe you can&#8217;t define what &#8220;important&#8221; means directly, but you can define it implicitly by making sure those agents also care more about worlds which aren&#8217;t simulated. But this all seems to be surprisingly difficult. I think there is a very small probability that this captures a rather serious obstruction to defining things formally. (Perhaps it is easy to define something well enough when no one is looking, but once anyone cares whatever definition you use is going to get manipulated. If something like this were really the case, it would create systems with very exotic failure modes.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/214/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=214&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Oracles</title>
		<link>http://ordinaryideas.wordpress.com/2012/04/25/oracles/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/04/25/oracles/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 23:40:29 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=155</guid>
		<description><![CDATA[Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=155&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Suppose that we developed software oracles which could apply large amounts of computational power to solving any formally specified problem (say, you could pay $1k for a human-equivalent). For example, such oracles could find good moves in games which could be characterized completely, or prove theorems in areas which could be rigorously axiomatized, or design machines to achieve a formally specified goal in physical systems which can be modeled algorithmically. What would happen, when these oracles became widely available?<span id="more-155"></span></p>
<p>The situation may be somewhat analogous to the development of computers themselves, which are able to apply astronomical amounts of computational power to executing precisely specified protocols. Having developed such tools humanity did not find all of its old problems immediately resolved. Instead, describing what we want done in complete and rigorous detail has become one of our primary economic activities. Human labor is now reserved for implementing protocols that we can&#8217;t describe precisely, which turns out to be many of them (including in particular programming itself).</p>
<p>I expect that if an oracle became widely available, phrasing questions of interest algorithmically would present a similar bottleneck in most applications. You might hope that science and engineering would immediately be mostly automated, but a large part of both enterprises involves dealing with systems which we can&#8217;t yet characterize exactly, or which would require substantial human labor to characterize exactly. I would be surprised if you could ask algorithmically precise questions that would eliminate humans&#8217; role in designing better DNA synthesis, for example. Instead, you would probably get substantial but not unprecedented productivity gains in these domains, and a lot of human effort would be redirected from problem-solving to problem-defining.</p>
<p>Questions which could already be formulated precisely would naturally benefit most from the availability of oracles. As mentioned, some areas of mathematics and computer science would benefit enormously, while natural sciences and engineering would lag behind. Other areas, such as policy, management, and governance, would benefit even less.</p>
<p>However, there are some general techniques for turning a problem humans want solved into an algorithmic question, which have no analog in the case of computers, and in this hypothetical I expect they would be broadly utilized. Though we would have a very hard time formally describing models for systems we care about (a prerequisite to formally expressing many of the problems we want solved), we can much more easily describe distributions over possible models which <em>include</em> good models for the real world, such as the universal prior. We can formally define a model simply by saying: start with some universal distribution over models and then update that distribution based on these observations of some system we care about. Once we have defined a model in this way, it is possible to define tasks formally with respect to this model. For example, if we consider models which take some inputs to be fed to a manipulator, and output not only observations but also reward signals, we can express the task &#8220;find inputs to this model which maximize the expected future reward.&#8221; Because we don&#8217;t understand the internal structure of the models produced by this process, it would take much more problem-formalizing labor to express a non-reward based task (and in general it isn&#8217;t clear how to do it.)</p>
<p>This is an extremely powerful technique, and allows us to apply our formal oracles to a much broader class of systems than we otherwise could.</p>
<p>Unfortunately, oracles being used in this way are very powerful for engineering and scientific applications (and perhaps even for management/policy/governance) but, if human-level, they are extremely dangerous. Traditional arguments about AI risk apply to <em>exactly</em> this sort of system. Maybe there are other techniques which are similarly powerful/general, don&#8217;t produce this sort of goal-oriented agent, and would be more broadly applied by virtue of their safety. I don&#8217;t know of any. It looks to me like there is a reasonable chance that most of the power would quickly shift to oracles being used as parts of goal-oriented agents, and that eventually this would lead to trouble.</p>
<p>If such oracles became available, the best case scenario is probably that they would be used to develop more robustly beneficial technologies (either a different sort of AI, or human brain emulations). This seems like it would either require a long lead time for a conscientious project, or else a shorter lead time and a clear understanding of how to do this bootstrapping. Broadly, I can imagine three approaches.</p>
<ol>
<li>Create goal-oriented agents out of oracles, and engineer environments in which those agents will tend to cooperate with humans.</li>
<li>Apply oracles to problems which can be formally defined using techniques we currently understand. Figure out how to use the solutions to this narrower class of problems to bootstrap up to safe AI or emulations.</li>
<li>Discover some other general techniques for applying oracles to real-world problems.</li>
</ol>
<p>All three approaches seem worth thinking about. I&#8217;m going to make a few posts exploring (1), which currently looks like the most promising. Thinking about (2) in advance is a little harder, because your ability to interact adaptively with the oracle is probably useful. But I haven&#8217;t seen any serious suggestions even for the first steps of such a program, and that would certainly be valuable. (&#8220;Wing it&#8221; doesn&#8217;t seem like a good solution, given the potential instability of the situation.) I don&#8217;t know about (3)&#8211;it seems a little unlikely that using an oracle to build a utility maximizer is the only way to automatically formalize a generic real-world problem, but I don&#8217;t know of any approaches that aren&#8217;t variations on this theme.</p>
<p>Note that if many oracles are being effectively employed serving humans&#8217; interests, then the existence of a few oracles trying to be destructive may not be problematic (since they lack any epistemic advantage over the oracles being usefully exploited&#8211;this would only be problematic in universes where offense fundamentally outpaces defense, which seems plausible to me but is another question). The problem is that there aren&#8217;t obviously any safe ways to use oracles without handicapping them or limiting their applicability substantially.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/155/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=155&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/04/25/oracles/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>A formalization of indirect normativity</title>
		<link>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/#comments</comments>
		<pubDate>Sat, 21 Apr 2012 02:51:12 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=121</guid>
		<description><![CDATA[This post outlines a formalization of what Nick Bostrom calls &#8220;indirect normativity.&#8221; I don&#8217;t think it&#8217;s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the &#8220;not terrible&#8221; bar, i.e. which does not lead to terrible consequences if pursued without any caveats [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=121&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p dir="ltr">This post outlines a formalization of what Nick Bostrom calls &#8220;indirect normativity.&#8221; I don&#8217;t think it&#8217;s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the &#8220;not terrible&#8221; bar, i.e. which does not lead to terrible consequences if pursued without any caveats or restrictions.<span id="more-121"></span></p>
<h1>Introduction</h1>
<p dir="ltr">When faced with the challenge of writing down precise moral principles, adhering to the standards demanded in mathematics, moral philosophers encounter two serious difficulties:</p>
<ul>
<li>Basic notions, like “agent,” “act,” and “motive” are themselves almost inherently imprecise. More complex concepts like “well-being,” “possible world,” and “honesty” are even less approachable from a formal perspective.</li>
<li>Even expressed in terms of these notions, satisfactory moral principles have proven incredibly elusive. Committing ourselves to any simple, exceptionless theory has proved incompatible with basic ethical intuitions.</li>
</ul>
<p dir="ltr">In light of these difficulties, a moral philosopher might simply declare: “It is not my place to aspire to mathematical standards of precision. Ethics as a project inherently requires shared language, understanding, and experience; it becomes impossible or meaningless without them.”</p>
<p>This may be a defensible philosophical position, but unfortunately the issue is not entirely philosophical. In the interest of building institutions or machines which reliably pursue what we value, we may one day be forced to describe precisely “what we value” in a way that does not depend on charitable or “common sense” interpretation (in the same way that we today must describe “what we want done” precisely to computers, often with considerable effort). If some aspects of our values cannot be described formally, then it may be more difficult to use institutions or machines to reliably satisfy them. This is not to say that describing our values formally is necessary to satisfying them, merely that it might make it easier.</p>
<p>Since we are focusing on finding any precise and satisfactory moral theory, rather than resolving disputes in moral philosophy, we will adopt a consequentialist approach without justification and focus on axiology. Moreover, we will begin from the standpoint of expected utility maximization, and leave aside questions about how or over what space the maximization is performed.</p>
<p>We aim to mathematically define a utility function U such that we would be willing to build a hypothetical machine which exceptionlessly maximized U, possibly at the catastrophic expense of any other values. We will assume that the machine has an ability to reason which at least rivals that of humans, and is willing to tolerate arbitrarily complex definitions of U (within its ability to reason about them).</p>
<p>We adopt an indirect approach. Rather than specifying what exactly we want, we specify a process for determining what we want. This process is extremely complex, so that any computationally limited agent will always be uncertain about the process’ output. However, by reasoning about the process it is possible to make judgments about which action has the highest expected utility in light of this uncertainty.</p>
<p>For example, I might adopt the principle: “a state of affairs is valuable to the extent that I would judge it valuable after a century of reflection.” In general I will be uncertain about what I would say after a century, but I can act on the basis of my best guesses: after a century I will probably prefer worlds with more happiness, and so today I should prefer worlds with more happiness. After a century I have only a small probability of valuing trees’ feelings, and so today I should go out of my way to avoid hurting them if it is either instrumentally useful or extremely easy. As I spend more time thinking, my beliefs about what I would say after a century may change, and I will start to pursue different states of affairs even though the formal definition of my values is static. Similarly, I might desire to think about the value of trees’ feelings, if I expect that my opinions are unstable: if I spend a month thinking about trees, my current views will then be a much better predictor of my views after a hundred years, and if I know better whether or not trees’ feelings are valuable, I can make better decisions.</p>
<p>This example is quite informal, but it communicates the main idea of the approach. We stress that the value of our contribution, if any, is in the possibility of a precise formulation. (Our proposal itself will be relatively informal; instead it is a description of how you would arrive at a precise formulation.) The use of indirection seems to be necessary to achieve the desired level of precision.</p>
<h1>The Proposal</h1>
<p dir="ltr">Our proposal contains only two explicit steps:</p>
<ol>
<li>Obtain a precise mathematical characterization of a particular human’s brain.</li>
<li>Precisely define a completely abstract environment containing an idealized and unbounded computer. Adopt whatever utility function the human would decide on, if allowed free reign in this abstract environment. (With the technical modification that the utility is restricted to take on values between 0 and 1.)</li>
</ol>
<p dir="ltr">Each of these steps requires substantial elaboration, but we must also specify what we expect the human to do with these tools.</p>
<ol start="3">
<li>Build a community of copies of herself and an idealized environment for interaction. Deliberate on critical scientific and philosophical questions, and decide how to proceed (possibly in the spirit of what follows).</li>
<li>Search through the space of all possible worlds to find a simulation of our own. By interacting with this simulation, incorporate simulations of many living humans into her environment.</li>
<li>Continue to make scientific and philosophical progress, as safely as possible. Possibly engage in principled self-modification or understand and build machine intelligences which reflect relevant ethical principles.</li>
<li>Output a utility function. This function may be a compact set of ethical principles, or it may require many simulated humans or machine intelligences to apply their own (radically modified) common sense to evaluate states of affairs on a case-by-case basis.</li>
</ol>
<p dir="ltr">This proposal is best understood in the context of other fantastic-seeming proposals, such as “my utility is whatever I would write down if I reflected for a thousand years without interruption or biological decay.” The counterfactual events which take place within the definition are far beyond the realm our intuition recognizes as “realistic,” and have no place except in thought experiments. But to the extent that we can reason about these counterfactuals and change our behavior on the basis of that reasoning (if so motivated), we can already see how such fantastic situations could affect our more prosaic reality.</p>
<p>The remainder of this document consists of brief elaboration of some of these steps, and a few arguments about why this is a desirable process.</p>
<h1>Brain Emulation</h1>
<p dir="ltr">The first step of our proposal is a high-fidelity mathematical model of human cognition. We will set aside philosophical troubles, and assume that the human brain is a purely physical system which may be characterized mathematically. Even granting this, it is not clear how we can realistically obtain such a characterization.</p>
<p>The most obvious approach to characterizing a brain is to combine measurements of its behavior or architecture with an understanding of biology, chemistry, and physics. This project represents a massive engineering effort which is currently just beginning. Most pessimistically, our proposal could be postponed until this project’s completion. This could still be long before the mathematical characterization of the brain becomes useful for running experiments or automating human activities: because we are interested only in a definition, we do not care about having the computational resources necessary to simulate the brain.</p>
<p>An impractical mathematical definition, however, may be much easier to obtain. We can define a model of a brain in terms of exhaustive searches which could never be practically carried out. For example, given some observations of a neuron, we can formally define a brute force search for a model of that neuron. Similarly, given models of individual neurons we may be able to specify a brute force search over all ways of connecting those neurons which account for our observations of the brain (say, some data acquired through functional neuroimaging).</p>
<p>It may be possible to carry out this definition without exploiting any structural knowledge about the brain, beyond what is necessary to measure it effectively. By collecting imaging data for a human exposed to a wide variety of stimuli, we can recover a large corpus of data which must be explained by any model of a human brain. Moreover, by using our explicit knowledge of human cognition we can algorithmically generate an extensive range of tests which identify a successful simulation, by probing responses to questions or performance on games or puzzles.</p>
<p>In fact, this project may be possible using existing resources. The complexity of the human brain is not as unapproachable as it may at first appear: though it may contain 10<sup>14</sup> synapses, each described by many parameters, it can be specified much more compactly. A newborn’s brain can be specified by about 10<sup>9</sup> bits of genetic information, together with a recipe for a physical simulation of development. The human brain appears to form new long-term memories at a rate of 1-2 bits per second, suggesting that it may be possible to specify an adult brain using 10<sup>9</sup> additional bits of experiential information. This suggests that it may require only about 10<sup>10</sup> bits of information to specify a human brain, which is at the limits of what can be reasonably collected by existing technology for functional neuroimaging.</p>
<p>This discussion has glossed over at least one question: what do we mean by ‘brain emulation’? Human cognition does not reside in a physical system with sharp boundaries, and it is not clear how you would define or use a simulation of the “input-output” behavior of such an object.</p>
<p>We will focus on some system which does have precisely defined input-output behavior, and which captures the important aspects of human cognition. Consider a system containing a human, a keyboard, a monitor, and some auxiliary instruments, well-insulated from the environment except for some wires carrying inputs to the monitor and outputs from the keyboard and auxiliary instruments (and wires carrying power). The inputs to this system are simply screens to be displayed on the monitor (say delivered as a sequence to be displayed one after another at 30 frames per second), while the outputs are the information conveyed from the keyboard and the other measuring apparatuses (also delivered as a sequence of data dumps, each recording activity from the last 30th of a second).</p>
<p>This “human in a box” system can be easily formally defined if a precise description of a human brain and coarse descriptions of the human body and the environment are available. Alternatively, the input-output behavior of the human in a box can be directly observed, and a computational model constructed for the entire system. Let H be a mathematical definition of the resulting (randomized) function from input sequences (In(1), In(2), …, In(K)) to the next output Out(K). H is, by design, a good approximation to what the human “would output” if presented with any particular input sequence.</p>
<p>Using H, we can mathematically define what “would happen” if the human interacted with a wide variety of systems. For example, if we deliver Out(K) as the input to an abstract computer running some arbitrary software, and then define In(K+1) as what the screen would next display, we can mathematically define the distribution over transcripts which would have arisen if the human had interacted with the abstract computer. This computer could be running an interactive shell, a video game, or a messaging client.</p>
<p>Note that H reflects the behavior of a particular human, in a particular mental state. This state is determined by the process used to design H, or the data used to learn it. In general, we can control H by choosing an appropriate human and providing appropriate instructions / training. More emulations could be produced by similar measures if necessary. Using only a single human may seem problematic, but we will not rely on this lone individual to make all relevant ethical judgments. Instead, we will try to select a human with the motivational stability to carry out the subsequent steps faithfully, which will define U using the judgment of a community consisting of many humans.</p>
<p>This discussion has been brief and has necessarily glossed over several important difficulties. One difficulty is the danger of using computationally unbounded brute force search, given the possibility of short programs which exhibit goal-oriented behavior. Another difficulty is that, unless the emulation project is extremely conservative, the models it produces are not likely to be fully-functional humans. Their thoughts may be blurred in various ways, they may be missing many memories or skills, and they may lack important functionalities such as long-term memory formation or emotional expression. The scope of these issues depends on the availability of data from which to learn the relevant aspects of human cognition. Realistic proposals along these lines will need to accommodate these shortcomings, relying on distorted emulations as a tool to construct increasingly accurate models.</p>
<h1>The Virtual Environment</h1>
<p dir="ltr">For any idealized “software”, with a distinguished instruction return, we can use H to mathematically define the distribution over return values which would result, if the human were to interact with that software. We will informally define a particular program T which provides a rich environment, in which the remainder of our proposal can be implemented. From a technical perspective this will be the last step of our proposal. The remaining steps will be reflected only in the intentions and behavior of the human being simulated in H.</p>
<p>Fix a convenient and adequately expressive language (say a dialect of Python designed to run on an abstract machine). T implements a standard interface for an interactive shell in this language: the user can look through all of the past instructions that have been executed and their return values (rendered as strings) or execute a new instruction. We also provide symbols representing H and T themselves (as functions from sequences of K inputs to a value for the Kth output). We also provide some useful information (such as a snapshot of the Internet, and some information about the process used to create H and T), which we encode as a bit string and store in a single environment variable data. We assume that our language of choice has a return instruction, and we have T return whenever the user executes this instruction. Some care needs to be taken to define the behavior if T enters an infinite loop&#8211;we want to minimize the probability that the human accidentally hangs the terminal, with catastrophic consequences, but we cannot provide a complete safety-net without running into unresolvable issues with self-reference.</p>
<p>We define U to be the value returned by H interacting with T. If H represented an unfortunate mental state, then this interaction could be short and unproductive: the simulated human could just decide to type ‘return 0’ and be done with it. However, by choosing an appropriate human to simulate and inculcating an appropriate mental state, we can direct the process further.</p>
<p>We intend for H to use the resources in T to initiate a larger deliberative process. For example, the first step of this process may be to instantiate many copies of H, interacting with variants of messaging clients which are in contact with each other. The return value from the original process could then be defined as the value returned by a designated ‘leader’ from this community, or as a majority vote amongst the copies of H, or so on.  Another step might be to create appropriate realistic virtual environments for simulated brains, rather than confining them to boxes. For motivational stability, it may be helpful to design various coordination mechanisms, involving frameworks for interaction, “cached” mental states which are frequently re-instantiated, or sanity checks whereby one copy of H monitors the behavior of another.</p>
<p>The resulting communities of simulated brains then engage in a protracted planning process, ensuring that subsequent steps can be carried out safely or developing alternative approaches. The main priority of this community is to reduce the probability of errors as far as possible (exactly what constitutes an ‘error’ will be discussed at more length later). At the end of this process, we obtain a formal definition of a new protocol H+, which submits its inputs for consideration to a large community and then produces its outputs using some deliberation mechanism (democratic vote, one leader using the rest of the community as advisors, etc.)</p>
<h1>Finding Earth</h1>
<p dir="ltr">The next step requires our community of simulated brains to construct a detailed simulation of Earth which they can observe and manipulate. Once they have such a simulation, they have access to all of the data which would have been available on Earth. In particular, they can now explore many possible futures and construct simulations for each living human.</p>
<p>In order to locate Earth, we will again leverage an exhaustive search. First, H+ decides on informal desiderata for an “Earth simulation.” These are likely to be as follows:</p>
<ul>
<li>Comprehensible: the code for the simulation should relate to the best understanding of physics which H+ can develop. This may require operating at a coarser level than the “real” physics underlying our world, and it probably requires structuring the code for the simulation in a way that is not maximally efficient. This requirement is designed to ensure that the ‘views’ and ‘interventions’ produced below are acting the expected way when they are being tested for accuracy, so that generalization to new situations will work as expected.</li>
<li>Supports viewing: H+ should decide on an event to use as a landmark (such as a particular data entry process involved in the definition of H), and search for simulations which support viewing scenes defined relative to that landmark. For example, they might require the simulation to provide a function View(x, y, z, t), which provides a “human’s view” from a point specified by the coordinates (x, y, z, t), defined relative to the landmark.</li>
<li>Accurate: the simulation should consistently conform to expectations about Earth. The history of the individual simulated in H should conform to H’s simulated memories; the simulation should follow the same broad historical course as described in data and remembered by H; all of the other data (for example, the complete snapshot of the Internet) should also be in agreement. Perhaps most importantly, the details of whatever system’s behavior depends on U should be recorded as part of the data.</li>
<li>Supports intervention: H+ should decide on a particular class of interventions (such as non-physically setting the output of a particular invocation of one program to be equal to S), and search for simulations which support intervening in that way. For example, they might require the simulation to provide a function Modify(S) which causes all future View() instructions to show what would have happened if the output of the intervention corresponding to S had been performed.</li>
</ul>
<p dir="ltr">Once H+ has decided on the desiderata, it uses a brute force search to find a simulation satisfying them: for each possible program it instantiates a new copy of H+ tasked with evaluating whether that program is an acceptable simulation. We then define E to be a uniform distribution over programs which pass this evaluation.</p>
<p>We might have doubts about whether this process produces the “real” Earth&#8211;perhaps even once we have verified that it is identical according to a laundry list of measures, it may still be different in other important ways. There are two reasons why we might care about such differences. First, if the simulated Earth has a substantially different set of people than the real Earth, then a different set of people will be involved in the subsequent decision making. If we care particularly about the opinions of the people who actually exist (which the reader might well, being amongst such people!) then this may be unsatisfactory. Second, if events transpire significantly differently on the simulated Earth than the real Earth, value judgments designed to guide behavior appropriately in the simulated Earth may lead to less appropriate behaviors in the real Earth. (This will not be a problem if our ultimate definition of U consists of universalizable ethical principles, but we will see that U might take other forms.)</p>
<p>These concerns are addressed by a few broad arguments. First, checking a detailed but arbitrary ‘laundry list’ actually provides a very strong guarantee. For example, if this laundry list includes verifying a snapshot of the Internet, then every event or person documented on the Internet must exist unchanged, and every keystroke of every person composing a document on the Internet must not be disturbed. If the world is well interconnected, then it may be very difficult to modify parts of the world without having substantial effects elsewhere, and so if a long enough arbitrary list of properties is fixed, we expect nearly all of the world to be the same as well. Second, if the essential character of the world is fixed but detailed are varied, we should expect the sort of moral judgments reached by consensus to be relatively constant. Finally, if the system whose behavior depends on these moral judgments is identical between the real and simulated worlds, then outputting a U which causes that system to behave a certain way in the simulated world will also cause that system to behave that way in the real world.</p>
<p>Once H+ has defined a simulation of the world which permits inspection and intervention, by careful trial and error H+ can inspect a variety of possible futures. In particular, they can find interventions which cause the simulated human society to conduct a real brain emulation project and produce high-fidelity brain scans for all living humans.</p>
<p>Once these scans have been obtained, H+ can use them to define U as the output of a new community, H++, which draws on the expertise of all living humans operating under ideal conditions. There are two important degrees of flexibility: how to arrange the community for efficient communication and deliberation, and how to delegate the authority to define U. In terms of organization, the distinction between different approaches is probably not very important. For example, it would probably be perfectly satisfactory to start from a community of humans interacting with each other over something like the existing Internet (but on abstract, secure infrastructure). More important are the safety measures which would be in place, and the mechanism for resolving differences of value between different simulated humans.</p>
<p>The basic approach to resolving disputes is to allow each human to independently create a utility function U, each bounded in the interval [0, 1], and then to return their average. This average can either be unweighted, or can be weighted by a measure of each individual’s influence in the real world, in accordance with a game-theoretic notion like the Shapley value applied to abstract games or simulations of the original world. More sophisticated mechanisms are also possible, and may be desirable. Of course these questions can and should be addressed in part by H+ during its deliberation in the previous step. After all, H+ has access to an unlimited length of time to deliberate and has infinitely powerful computational aids. The role of our reasoning at this stage is simply to suggest that we can reasonably expect H+ to discover effective solutions.</p>
<p>As when discussing discovering a brain simulation by brute force, we have skipped over some critical issues in this section. In general, brute force searches (particularly over programs which we would like to run) are quite dangerous, because such searches will discover many programs with destructive goal-oriented behaviors. To deal with these issues, in both cases, we must rely on patience and powerful safety measures.</p>
<h1>Extrapolation</h1>
<p dir="ltr">Once we have a formal description of a community of interacting humans, given as much time as necessary to deliberate and equipped with infinitely powerful computational aids, it becomes increasingly difficult to make coherent predictions about their behavior. Critically, though, we can also become increasingly confident that the outcome of their behavior will reflect their intentions. We sketch some possibilities, to illustrate the degree of flexibility available.</p>
<p>Perhaps the most natural possibility is for this community to solve some outstanding philosophical problems and to produce a utility function which directly captures their preferences. However, even if they quickly discovered a formulation which appeared to be attractive, they would still be wise to spend a great length of time and to leverage some of these other techniques to ensure that their proposed solution was really satisfactory.</p>
<p>Another natural possibility is to eschew a comprehensive theory of ethics, and define value in terms of the community’s judgment. We can define a utility function in terms of the hypothetical judgments of astronomical numbers of simulated humans, collaboratively evaluating the goodness of a state of affairs by examining its history at the atomic level, understanding the relevant higher-order structure, and applying human intuitions.</p>
<p>It seems quite likely that the community will gradually engage in self-modifications, enlarging their cognitive capacity along various dimensions as they come to understand the relevant aspects of cognition and judge such modifications to preserve their essential character. Either independently or as an outgrowth of this process, they may (gradually or abruptly) pass control to machine intelligences which they are suitably confident expresses their values. This process could be used to acquire the power necessary to define a utility function in one of the above frameworks, or understanding value-preserving self-modification or machine intelligence may itself prove an important ingredient in formalizing what it is we value. Any of these operations would be performed only after considerable analysis, when the original simulated humans were extremely confident in the desirability of the results.</p>
<p>Whatever path they take and whatever coordination mechanisms they use, eventually they will output a utility function U’. We then define U = 0 if U’ &lt; 0, U = 1 if U’ &gt; 1, and U = U’ otherwise.</p>
<h1>Desirability</h1>
<p dir="ltr">At this point we have offered a proposal for formally defining a function U. We have made some general observations about what this definition entails. But now we may wonder to what extent U reflects our values, or more relevantly, to what extent our values are served by the creation of U-maximizers. Concerns may be divided into a few natural categories:</p>
<ul>
<li>Even if the process works as intended, the ultimate intentions of the simulations within the process do not reflect our values, so we should not expect them to output a U which reflects our values.</li>
<li>The process has some chance of failing to work as intended. If it fails completely, then value will certainly be lost. If it fails with some small probability (over the stochasticity within the process, or over our uncertainty about its behavior) then the resulting utility function may be substantially altered and desirable outcomes may no longer be achieved.</li>
<li>Any real U-maximizer will be unable to actually carry out the simulation described in the definition of U, and so even if this process would produce a U reflecting our values, it is not clear how a real U-maximizer will behave.</li>
<li>Because we have started from the standpoint of bounded expected utility maximization we have ruled out the vast majority of possible value systems. It may be that there is no way to “shorehorn” our values into this framework, so that no matter what bounded U we choose, the resulting agent doesn’t satisfy our values very well.</li>
<li>Are there possible negative consequences to “passing the buck” as in this proposal?</li>
<li>Could the process itself be morally abhorrent?</li>
</ul>
<p dir="ltr">We respond to each of these objections in turn.</p>
<h2>If it Works as Intended, Will This Process Reflect our Values?</h2>
<p dir="ltr">If the process works as intended, we will reach a stage in which a large community of humans reflects on their values, undergoes a process of discovery and potentially self-modification, and then outputs its result. We may be concerned that this dynamic does not adequately capture what we value.</p>
<p>For example, we may believe that some other extrapolation dynamic captures our values, or that it is morally desirable to act on the basis of our current beliefs without further reflection, or that the presence of realistic disruptions, such as the threat of catastrophe, has an important role in shaping our moral deliberation.</p>
<p>The important observation, in the defense of our proposal, is that whatever objections we could think of today, we could think of within the simulation. If, upon reflection, we decide that too much reflection is undesirable, we can simply change our plans appropriately. If we decide that realistic interference is important for moral deliberation, we can construct a simulation in which such interference occurs, or determine our moral principles by observing moral judgments in our own world’s possible futures.</p>
<p>There is some chance that this proposal is inadequate for some reason which won’t be apparent upon reflection, but then by definition this is a fact which we cannot possibly hope to learn by deliberating now. It therefore seems quite difficult to maintain objections to the proposal along these lines.</p>
<p>One aspect of the proposal does get “locked in,” however, after being considered by only one human rather than by a large civilization: the distribution of authority amongst different humans, and the nature of mechanisms for resolving differing value judgments.</p>
<p>Here we have two possible defenses. One is that the mechanism for resolving such disagreements can be reflected on at length by the individual simulated in H. This individual can spend generations of subjective time, and greatly expand her own cognitive capacities, while attempting to determine the appropriate way to resolve such disagreements. However, this defense is not completely satisfactory: we may be able to rely on this individual to produce a very technically sound and generally efficient proposal, but the proposal itself is quite value laden and relying on one individual to make such a judgment is in some sense begging the question.</p>
<p>A second, more compelling, defense, is that the structure of our world has already provided a mechanism for resolving value disagreements. By assigning decision-making weight in a way that depends on current influence (for example, as determined by the simulated ability of various coalitions to achieve various goals), we can generate a class of proposals which are at a minimum no worse than the status quo. Of course, these considerations will also be shaped by the conditions surrounding the creation or maintenance of systems which will be guided by U&#8211;for example, if a nation were to create a U-maximizer, they might first adopt an internal policy for assigning influence on U.  By performing this decision making in an idealized environment, we can also reduce the likelihood of destructive conflict and increase the opportunities for mutually beneficial bargaining. We may have moral objections to codifying this sort of “might makes right” policy, favoring a more democratic proposal or something else entirely, but as a matter of empirical fact a more ‘cosmopolitan’ proposal will be adopted only if it is supported by those with the appropriate forms of influence, a situation which is unchanged by precisely codifying existing power structure.</p>
<p>Finally, the values of the simulations in this process may diverge from the values of the original human models, for one reaosn or another. For example, the simulated humans may predictably disagree with the original models about ethical questions by virtue of (probably) having no physical instantiation. That is, the output of this process is defined in terms of what a particular human would do, in a situation which that human knows will never come to pass. If I ask “What would I do, if I were to wake up in a featureless room and told that the future of humanity depended on my actions?” the answer might begin with “become distressed that I am clearly inhabiting a hypothetical situation, and adjust my ethical views to take into account the fact that people in hypothetical situations apparently have relevant first-person experience.” Setting aside the question of whether such adjustments are justified, they at least raise the possibility that our values may diverge from those of the simulations in this process.</p>
<p>These changes might be minimized, by understanding their nature in advance and treating them on a case-by-case basis (if we can become convinced that our understanding is exhaustive). For example, we could try and use humans who robustly employ updateless decision theories which never undergo such predictable changes, or we could attempt to engineer a situation in which all of the humans being emulated do have physical instantiations, and naive self-interest for those emulations aligns roughly with the desired behavior (for example, by allowing the early emulations to “write themselves into” our world).</p>
<h2>Will This Process Work as Intended?</h2>
<p dir="ltr">We can imagine many ways in which this process can fail to work as intended&#8211;the original brain emulations may accurately model human behavior, the original subject may deviate from the intended plans, or simulated humans can make an error when interacting with their virtual environment which causes the process to get hijacked by some unintended dynamic.</p>
<h3>Robustness</h3>
<p>We can argue that the proposal is likely to succeed, and can bolster the argument in various ways (by reducing the number of assumptions necessary for succees, building in fault-tolerance, justifying each assumption more rigorously, and so on). However, we are unlikely to eliminate the possibility of error. Therefore we need to argue that if the process fails with some small probability, the resulting values will only be slightly disturbed.</p>
<p>This is the reason for requiring U to lie in the interval [0, 1]&#8211;we will see that this restriction bounds the damage which may be done by an unlikely failure.</p>
<p>If the process fails with some small probability ε, then we can represent the resulting utility function as U = (1 &#8211; ε) U1 + ε U2, where U1 is the intended utility function and U2 is a utility function produced by some arbitrary error process. Now consider two possible states of affairs A and B such that U1(A) &gt; U1(B) + ε /(1 &#8211; ε) ≈ U1(B) + ε. Then since 0 ≤  U2 ≤ 1, we have:</p>
<p dir="ltr">U(A) = (1 &#8211; ε) U1(A) + ε U2(A) &gt; (1 &#8211; ε) U1(B) + ε ≥ (1 &#8211; ε) U1(B) + ε U2(B) = U(B)</p>
<p dir="ltr">Thus if A is substantially better than B according to U1, then A is better than B according to U. This shows that a small probability of error, whether coming from the stochasticity of our process or an agent’s uncertainty about the process’ output, has only a small effect on the resulting values.</p>
<p>Moreover, the process contains a humans who have access to a simulation of our world. This implies, in particular, that they have access to a simulation of whatever U-maximizing agents exist in the world, and they have knowledge of those agents’ beliefs about U. This allows them to choose U with perfect knowledge of the effects of error in these agents’ judgments.</p>
<p>In some cases this will allow them to completely negate the effect of error terms. For example, if the randomness in our process causes a perfectly cooperate community of simulated humans to “control” U with probability ⅔, and causes an arbitrary adversary to control it with probability ⅓, then the simulated humans can spend half of their mass outputting a utility function which exactly counters the effect of the adversary.</p>
<p>In general, the situation is not quite so simple: the fraction of mass controlled by any particular coalition will vary as the system’s uncertainty about U varies, and so it will be impossible to counteract the effect of an error term in a way which is time-independent.  Instead, we will argue later that an appropriate choice of a bounded and noisy U can be used to achieve a very wide variety of effective behaviors of U-maximizers, overcoming the limitations both of bounded utility maximization and of noisy specification of utility functions.</p>
<h3>Other errors</h3>
<p>Many possible problems with this scheme were described or implicitly addressed above. But that discussion was not exhaustive, and there are some classes of errors that fall through the cracks.</p>
<p>One interesting class of failures concerns changes in the values of the hypothetical human H. This human is in a very strange situation, and it seems quite possible that the physical universe we know contains extremely few instances of that situation (especially as the process unfolds and becomes more exotic). So H&#8217;s first-person experience of this situation may lead to significant changes in H&#8217;s views.</p>
<p>For example, our intuition that our own universe is valuable seems to be derived substantially from our judgment that our own first-person experiences are valuable. If hypothetically we found ourselves in a very alien universe, it seems quite plausible that we would judge the experiences within that universe to be morally valuable as well (depending perhaps on our initial philosophical inclinations).</p>
<p>Another example concerns our self-interest: much of individual humans&#8217; values seem to depend on their own anticipations about what will happen to them, especially when faced with the prospect of very negative outcomes. If hypothetically we woke up in a completely non-physical situation, it is not exactly clear what we would anticipate, and this may distort our behavior. Would we anticipate the planned thought experiment occurring as planned? Would we focus our attention on those locations in the universe where a simulation of the thought experiment might be occurring? This possibility is particularly troubling in light of the<a href="http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/"> incentives </a>our scheme creates&#8212;anyone who can manipulate H&#8217;s behavior can have a significant effect on the future of our world, and so many may be motivated to create simulations of H.</p>
<h2>How Will a U-Maximizer Behave In Light of Uncertainty About U?</h2>
<p dir="ltr">A realistic U-maximizer will not be able to carry out the process described in the definition of U&#8211;in fact, this process probably requires immensely more computing resources than are available in the universe. (It may even involve the reaction of a simulated human to watching a simulation of the universe!) To what extent can we make robust guarantees about the behavior of such an agent?</p>
<p>We have already touched on this difficulty when discussing the maxim “A state of affairs is valuable to the extent I would judge it valuable after a century of reflection.” We cannot generally predict our own judgments in a hundred years’ time, but we can have well-founded beliefs about those judgments and act on the basis of those beliefs. We can also have beliefs about the value of further deliberation, and can strike a balance between such deliberation and acting on our current best guess.</p>
<p>A U-maximizer faces a similar set of problems: it cannot understand the exact form of U, but it can still have well-founded beliefs about U, and about what sorts of actions are good according to U. For example, if we suppose that the U-maximizer can carry out any reasoning that we can carry out, then the U-maximizer knows to avoid anything which we suspect would be bad according to U (for example, torturing humans). Even if the U-maximizer cannot carry out this reasoning, as long as it can recognize that humans have powerful predictive models for other humans, it can simply appropriate those models (either by carrying out reasoning inspired by human models, or by simply asking).</p>
<p>Moreover, the community of humans being simulated in our process has access to a simulation of whatever U-maximizer is operating under this uncertainty, and has a detailed understanding of that uncertainty. This allows the community to shape their actions in a way with predictable (to the U-maximizer) consequences.</p>
<h2>Can Our Values be Expressed in This Framework?</h2>
<p dir="ltr">It is easily conceivable that our values cannot be captured by a bounded utility function. Easiest to imagine is the possibility that some states of the world are much better than others, in a way that requires unbounded utility functions. But it is also conceivable that the framework of utility maximization is fundamentally not an appropriate one for guiding such an agent’s action, or that the notion of utility maximization hides subtleties which we do not yet appreciate.</p>
<p>We will argue that it is possible to transform bounded utility maximization into an arbitrary alternative system of decision-making, by designing a utility function which rewards worlds in which the U-maximizer replaced itself with an alternative decision-maker.</p>
<p>It is straightforward to design a utility function which is maximized in worlds where any particular U-maximizer converted itself into a non-U-maximizer&#8211;even if no simple characterization can be found for the desired act, we can simply instantiate many communities of humans to look over a world history and decide whether or not they judge the U-maximizer to have acted appropriately.</p>
<p>The more complicated question is whether a realistic U-maximizer can be made to convert itself into a non-U-maximizer, given that it is logically uncertain about the nature of U. It is at least conceivable that it couldn’t: if the desirability of some other behavior is only revealed by philosophical considerations which are too complex to ever be discovered by physically limited agents, then we should not expect any physically limited U-maximizer to respond to those considerations. Of course, in this case we could also not expect normal human deliberation to correctly capture our values. The relevant question is whether a U-maximizer could switch to a different normative framework, if an ordinary investment of effort by human society revealed that a different normative framework was more appropriate.</p>
<p>If a U-maximizer does not spend any time investigating this possibility, than it may not be expected to act on it. But to the extent that we assign a significant probability to the simulated humans deciding that a different normative framework is more appropriate, and to the extent that the U-maximizer is able to either emulate or accept our reasoning, it will also assign a significant probability to this possibility (unless it is able to rule it out by more sophisticated reasoning). If we (and the U-maximizer) expect the simulations to output a U which rewards a switch to a different normative framework, and this possibility is considered seriously, then U-maximization entails exploring this possibility. If these explorations suggest that the simulated humans probably do recommend some particular alternative framework, and will output a U which assigns high value to worlds in which this framework is adopted and low value to worlds in which it isn’t, then a U-maximizer will change frameworks.</p>
<p>Such a “change of frameworks” may involve sweeping action in the world. For example, the U-maximizer may have created many other agents which are pursuing activities instrumentally useful to maximizing U. These agents may then need to be destroyed or altered; anticipating this possibility, the U-maximizer is likely to take actions to ensure that its current “best guess” about U does not get locked in.</p>
<p>This argument suggests that a U-maximizer could adopt an arbitrary alternative framework, if it were feasible to conclude that humans would endorse that framework upon reflection.</p>
<h2>Is “Passing the Buck” Problematic?</h2>
<p dir="ltr">Our proposal appears to be something of a cop out, in that it declines to directly take a stance on any ethical issues. Indeed, not only do we fail to specify a utility function ourselves, but we expect the simulations to which we have delegated the problem to in turn delegate it at least a few more times. Clearly at some point this process must bottom out with actual value judgments, and we may be concerned that this sort of “passing the buck” is just obscuring deeper problems which will arise when the process does bottom out.</p>
<p>As observed above, whatever such concerns we might have can also be discovered by the simulations we create. If there is some fundamental difficulty which always arises when trying to assign values, then we certainly have not exacerbated this problem by delegation. Nevertheless, there are at least two coherent objections one might raise:</p>
<ul>
<li>Even if the simulated humans can uncover any objections we could raise now, this does not guarantee that we can ignore all objections. After all, the objection “Isn’t passing the buck problematic?” could be raised at every stage, and always countered by the same response: “If it is problematic, then this will be realized by the people to whom we have passed the buck.” If we do not take this objection seriously, then it may be that none of the delegates take it seriously either, and this call and response could be repeated indefinitely.</li>
<li>This proposal could fail in many (potentially unexpected) ways. If fundamentally resolving ethics requires overcoming some difficulties which we are evading by passing the buck, then we may be adding additional risk without buying much benefit.</li>
</ul>
<p dir="ltr">Both of these objections can be met with a single response. In the current world, we face a broad range of difficult and often urgent problems. By passing the buck the first time, we delegate resolution of ethical challenges to a civilization which does not have to deal with some of these difficulties&#8211;in particular, it faces no urgent existential threats. This allows us to divert as much energy as possible to dealing with practical problems today, while still capturing most of the benefits of nearly arbitrarily extensive ethical deliberation.</p>
<h2>Does This Process Have Moral (Dis)Value?</h2>
<p dir="ltr">This process is defined in terms of the behavior of unthinkably many hypothetical brain emulations. It is conceivable that the moral status of these emulations may be significant.</p>
<p>We must make a distinction between two possible sources of moral value: it could be the case that a U-maximizer carries out simulations on physical hardware in order to better understand U, and these simulations have moral value, or it could be the case that the hypothetical emulations themselves have moral value.</p>
<p>In the first case, we can remark that the moral value of such simulations is itself incorporated into the definition of U. Therefore a U-maximizer will be sensitive to the possible suffering of simulations it runs while trying to learn about U&#8211;as long as it believes that we may might be concerned about the simulations’ welfare, upon reflection, it can rely as much as possible on approaches which do not involve running simulations, which deprive simulations of the first-person experience of discomfort, or which estimate outcomes by running simulations in more pleasant circumstances. If the U-maximizer is able to foresee that we will consider certain sacrifices in simulation welfare worthwhile, then it will make those sacrifices. In general, in the same way that we can argue that estimates of U reflect our values over states of affairs, we can argue that estimates of U reflects our values over processes for learning about U.</p>
<p>In the second case, a U-maximizer in our world may have little ability to influence the welfare of hypothetical simulations invoked in the definition of U. However, the possible disvalue of these simulations’ experiences are probably seriously diminished.</p>
<p>In general the moral value of such hypothetical simulations’ experiences is somewhat dubious. If we simply write down the definition of U, these simulations seem to have no more reality than story-book characters whose activities we describe.</p>
<p>The best arguments for their moral relevance comes from the great causal significance of their decisions: if the actions of a powerful U-maximizer depend on its beliefs about what a particular simulation would do in a particular situation, including for example that simulation’s awareness of discomfort or fear, or confusion at the absurdity of the hypothetical situation in which they find themselves, then it may be the case that those emotional responses are granted moral significance. However, although we may define astronomical numbers of hypothetical simulations, the detailed emotional responses of very view of these simulations will play an important role in the definition of U.</p>
<p>Moreover, for the most part the existences of the hypothetical simulations we define are extremely well-controlled by those simulations themselves, and may be expected to be counted as unusually happy by the lights of the simulations themselves. The early simulations (who have less such control) are created from an individual who has provided consent and is selected to find such situations particularly non-distressing.</p>
<p>Finally, we observe that U can exert control over the experiences of even hypothetical simulations. If the early simulations would experience morally relevant suffering because of their causal significance, but the later simulations they generate robustly disvalue this suffering, the later simulations can simulate each other and ensure that they all take the same actions, eliminating the causal significance of the earlier simulations.</p>
<h1></h1><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/121/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=121&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Decision-Theoretic Hazards</title>
		<link>http://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 00:00:34 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=114</guid>
		<description><![CDATA[Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=114&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior.<span id="more-114"></span></p>
<h2>Possible Failures</h2>
<h3>Coercion by Simulator</h3>
<p>Consider a very intelligent AI pursuing a reward signal (apparently) controlled by humans. A standard concern is that at some point the AI would prefer to take control from the humans and use this control to give itself reward.</p>
<p>Instead, the AI may decide that it is quite likely to be in a simulation and begin behaving unpredictably. For example, the AI may decide that the <em>vast majority</em> of agents who share its experiences are living in simulations controlled by uFAIs which successfully took over universes like ours. Such uFAIs may adopt policies like intervening in simulations to administer or withhold reward according to whether the reward-seeking AI takes actions which facilitate a uFAI-takeover. Moreover, it may begin these interventions precisely when the AI discovers this line of argument.</p>
<p>At first it seems like a TDT-style agent trying to maximize a mathematically defined utility is not susceptible to this sort of attack (except insofar as utility may be defined in terms of a mathematical object like the universal distribution which can be manipulated by a simulator).  But unfortunately this is not clear.</p>
<p>A TDT agent in general will need to rely on its environment to do useful computation, although the details of this relationship are not yet clear. An AI running on a large computer should probably be able to reason about possible worlds in which that computer behaves in a wide variety of ways (for example, counterfactual or simulated worlds in which the computer violates its ordinary contract) and should reason about how the AI&#8217;s behavior in those possible worlds affects utility. Understanding the possible effects of this uncertainty seems out of reach for now, but it does seem plausible that a simulator&#8217;s threat could control actions even of some TDT-style agents (in retrospect we would probably reject a decision theory with this property, so these considerations just suggest developing a much better understanding of decision theory).</p>
<h3>Altruism</h3>
<p>Although we imagine an AI as maximizing a given utility function, if the AI is able to reason about counterfactuals and acausal influence, it may make decisions with concern for their effects not only if it were in a different situation but if it had a different utility function. This may be particularly problematic when we try and use self-interest to control AI behavior. For example, suppose we build a system comprised of many AI&#8217;s such that one (which we do not yet know) has the ability to guarantee for itself its largest possibility utility. If we rely on that AI choosing to maximize its own utility, we may be surprised. Natural decision theories may make decisions from behind a veil of ignorance with respect to the definition of utility itself, helping others even when they have no use for the generosity of others.</p>
<h3>Pascal&#8217;s Wager</h3>
<p>On reflection, humans seem to be uncertain about their endorsed response to the possibility of very large payoffs arising with negligible probability (particularly when that probability comes entirely from radical model uncertainty). For example, many utilitarian-types feel like they are indifferent to a gamble which would equiprobably increase or decrease the number of happy future humans by 50%, but are horrified by the possibility of a galaxy-scale AI investing nearly all of its resources exploring improbable possibilities for exotic physics or escape from a simulation, and even more objectionable outcomes are conceivable. If there is are a priori conceivable outcomes which seem 3^^^3 better than the best outcome which seems mathematically conceivable upon inspection, should an AI focus exclusively on possible worlds where its mathematical reasoning is untrustworthy?</p>
<p>These problems may be fixed by an appropriate choice of values or dissolved by more careful philosophical consideration, but it is also possible that they are debilitating for certain decision theories. For example, VNM notwithstanding, choosing actions which maximize a mathematically/algorithmically defined utility function may not be a tenable design decision if we want our agent to entertain logical uncertainty (e.g. about whether the defined utility function is bounded).</p>
<h3>Game Theory</h3>
<p>AIs we build may encounter other agents in our future light cone. Moreover, if they have utility functions that care about other possible (or impossible) worlds they are very likely to reason about agents in other possible worlds.</p>
<p>In such situations we want our AI to capture gains from trade and other advantages from interaction, but we are also concerned about the possibility of blackmail or manipulation. In the worst and most bizarre cases, counterfactual uFAIs may blackmail an AI we create into producing very little value. Handling these situations correctly&#8211;cooperating on appropriate one-shot prisoner&#8217;s dilemmas, refusing to give into threats when appropriate, and so on&#8211;seems to depend on decision theory rather than values.</p>
<h3>Stranger Things than These</h3>
<p>I don&#8217;t know how strange the world is, or what sort of completely unexpected thing can go wrong. If someone had suggested acausal trade to me a year ago, or implied that a simulator might be able to control the behavior of a carefully constructed agent, I would not have taken the possibility seriously. And if no one had suggested it to me, I would have confidently endorsed schemes that failed for completely unanticipated reasons. I can certainly imagine that there is more room for weirdness. It is very hard to imagine formally ruling out the possibility (as it seems to require understanding decision theory for our own use, first, to see what should qualify as a proof). We can begin to imagine all sorts of strangeness, ranging from agents inhabiting and controling the formal systems we (shouldn&#8217;t, but did) trust, to more exotic forms of philosophical angst leading to erratic behavior.</p>
<h2>Punting Decision Theory</h2>
<p>We may leave the problem &#8220;figure out decision theory&#8221; to an AI with some naive but (hopefully) good enough decision theory. Hopefully such an AI will in turn figure out decision theory, create a new agent with a more robust decision theory, and thereby avoid the above problems. There are a few issues here, though:</p>
<h3>Early Difficulties</h3>
<p>An agent may encounter philosophically troubling or game-theoretically challenging ideas while still running a flawed decision theory. For example, as it is about to start running the new TDT agent it built it might discover an argument that no matter what it does it can&#8217;t get utility more than X, and by not self-modifying it can get utility X, so call off the modification and become trapped by its decision theory.</p>
<h3>Suboptimal attractors</h3>
<p>Agents with some decision theories, even decision theories which seem manifestly suboptimal to humans, will not fix their problems if given the chance. For example, an agent vulnerable to Pascal&#8217;s muggings may never remove this vulnerability (because in its view it is an opportunity to capture immense value). Or, we could imagine a CDT agent modifying into some sort of &#8220;exceptionless decision theory:&#8221; do what the original agent would have done in your situation. Someone running exceptionless decision theory may still be vulnerable to blackmail from an adversary who commits appropriately.</p>
<p>It may be necessary to get some aspects of decision theory right at first if we want to avoid the failures above in the long run.</p>
<h3>Value Drift</h3>
<p>It is possible that an AI trying to maximize the utility function U using one decision theory will eventually produce an AI running an improved decision theory but with some modified (in the worst case unrecognizably) utility function U&#8217;. For example, if an early decision theory leads to insufficiently cautious self-modification, the utility function may drift randomly; if the early AI encounters philosophical difficulties that lead it into existential despair, the utility function may only be concerned with worlds in which the AI&#8217;s early reasoning or perceptions were delusional.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/114/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/114/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=114&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Formal Instructions Revisited</title>
		<link>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/#comments</comments>
		<pubDate>Mon, 02 Jan 2012 06:16:50 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=108</guid>
		<description><![CDATA[Previously I&#8217;ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form &#8220;Simulate this civilization.&#8221;  Here is a much better proposal for issuing formal instructions. Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=108&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Previously I&#8217;ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form &#8220;Simulate <em>this</em> civilization.&#8221;  Here is a much better proposal for issuing formal instructions.</p>
<p>Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don&#8217;t have (formal) access to those preferences.<span id="more-108"></span></p>
<p>Suppose further that we can somehow define mathematically a counterfactual utopia&#8211;a possible world in which humans are likely to successfully complete whatever moral project we might optimistically hope for and build an FAI they are happy with. For example, perhaps we can mathematically pin down a brain emulation and describe algorithmically a (computationally unrestricted) environment in which the brain emulation can have free reign, and will have the ability to safely carry out extensive projects and instantiate other uploads.  We can provide the resulting civilization with a distinguished input channel and describe mathematically the distribution over possible messages they send on that channel.</p>
<p>Let P be the distribution over messages sent on this channel. Some messages can be interpreted as axioms describing potential utility U (or programs outputting real numbers). Since we can describe P mathematically, we can define a distribution over utilities U by conditioning P on being a valid description of utility and interpreting it in that way. Once we&#8217;ve described this probability distribution, we can specify our original agent&#8217;s utility as the expected utility if we use a definition drawn from this distribution.</p>
<p>With luck, the resulting utility function is as good a choice as we could have come up with, even if we had been given an eternity with no risk of catastrophe The civilization could produce a utility function expressing ideas like &#8220;do no harm, but do this instrumentally useful thing which will allow the original humans to build a real FAI,&#8221; for example. More optimistically, they could just find an appropriate dynamic directly which expresses human preferences (probably defined in terms of a mathematical description of human brains, or perhaps a catalog of all humans existing at the time the AI is run, or perhaps a description of agents from within this counterfactual utopia, or whatever).</p>
<p>There are a few obvious problems:</p>
<ul>
<li>With some probability the civilization will fail (e.g. be taken over by uFAI) and will output some utility function designed to take over our universe. To prevent this sort of failure, we should require each utility function to be bounded, say to have range [0, 1]. If we make this restriction and 99% of the mass of P lies on &#8220;good&#8221; utility functions, then we are guaranteed to get an outcome which maximizes a &#8220;good&#8221; utility function to within 0.01. (Of course, now we are stuck with a bounded utility function, but this is not such a serious issue since satisfying a bounded utility function may create an agent which maximizes an unbounded utility function. Here the bounded utility acts as a sort of &#8220;majority rules,&#8221; rather than an inherent obstruction)</li>
<li>There may be no utility function which we want to plug into such an AI. This framework may just be a bad one, in which it is not possible to even express an idea like &#8220;do no harm,&#8221; much less human value (or, our decision theory may cause the AI to behave badly regardless of utility function for other reasons). In this case we have more serious trouble, but at least we would have had serious trouble anyway.</li>
<li>Simulating the counterfactual utopia is likely to be much too computationally expensive for the AI to simulate using available resources. This isn&#8217;t really a problem, as the AI is encouraged to just take its &#8220;best guess,&#8221; understanding the mathematical/algorithmic definition of utility even if it cannot simulate the world in the definition&#8211;to the extent that human values cohere (in whatever abstract sense we might hope for) the AI can predict that some actions will be good without obtaining an exact statement of human utility.</li>
<li>Defining a good utility U may depend on understanding facts about our universe. This seems unlikely, as the counterfactual utopia may run huge simulations of varied laws of physics and search for copies of their remembered history within those simulations (or output a utility function that does the same). So if this consideration bites us (say because our laws of physics are radically different than we expect) we will first probably run into the next problem:</li>
<li>Our formal definitions of brain emulations, counterfactuals, moral bootstrapping, or whatever, may break down, for any one of many possible reasons.</li>
</ul><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/108/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=108&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>&#8220;Proof&#8221; of Friendliness</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 05:59:43 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=101</guid>
		<description><![CDATA[The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor. Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity&#8217;s [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=101&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.</p>
<p>Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of <strong>U</strong>, the utility of humanity&#8217;s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn&#8217;t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate <strong>F</strong>(X) = &#8220;The humans run the program described by X.&#8221;</p>
<p>To top it off, with the help of your oracle AI you found the code for a &#8220;best possible AI&#8221;, call it <strong>FAI</strong>, and a proof of the theorem:</p>
<blockquote><p>There exists a constant <strong>Best</strong> such that <strong>U</strong> ≤ <strong>Best</strong>, but <strong>F</strong>(<strong>FAI</strong>) implies <strong>U</strong> =  <strong>Best</strong>.&#8221;</p></blockquote>
<p>Each of these steps you did with incredible care. You have proved beyond reasonable doubt that <strong>U </strong>and<strong> F </strong>represent what you want them to.</p>
<p>You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running <strong>FAI</strong>, they will still receive <strong>U</strong> ≤ <strong>Best</strong>, so why bother?</p>
<p>Now satisfied and with the scheduled moment arrived, you finally run <strong>FAI</strong>. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become <em>as confident as possible</em> that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).</p>
<p>Just a lighthearted illustration that decision theory isn&#8217;t only hard for AI.</p>
<p>(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)</p>
<p>Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn&#8217;t change the fact that you would rather not use this procedure.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/101/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/101/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=101&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Counterfactual Blackmail (of oneself)</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 05:16:26 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=97</guid>
		<description><![CDATA[This post is not going to make any new claims, but this is a helpful intuition-pump for me. Loebian Problems in Decision Theory Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn&#8217;t take X, but is uncertain about its utility [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=97&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>This post is not going to make any new claims, but this is a helpful intuition-pump for me.</p>
<h3>Loebian Problems in Decision Theory</h3>
<p>Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn&#8217;t take X, but is uncertain about its utility if it takes X.</p>
<p>A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn&#8217;t take X, how is it supposed to reason about &#8220;what U would be if I take X?&#8221;<span id="more-97"></span></p>
<p>One approach is to exploit an agent&#8217;s uncertainty about its own future action. Although the agent&#8217;s action can be determined (by simulation), the agent doesn&#8217;t know its own outcome or the resulting utility, but may be able to discover some relationships between them. As long as the agent remains uncertain in this way, we can represent &#8220;what would happen if I take X&#8221; by conditioning my uncertainty about my own action.</p>
<p>Even though the agent cannot simulate itself to determine its own action, this approach is still problematic, due (at least) to the following reasoning:</p>
<blockquote><p>Consider the statement L =  &#8220;If I found a proof of L and still did X, it would be terrible.&#8221; If I found a proof of L then I could also prove that I&#8217;d proven L, and so I could prove &#8220;If I did X, it would be terrible.&#8221; But then I wouldn&#8217;t do X, and so in this case it would be vacuously true that &#8220;If I did X, it would be terrible.&#8221;</p>
<p>But now I&#8217;ve shown that if I could prove L and still did X, it would be (vacuously terrible). This is precisely the statement L, so L is true and I&#8217;ve just proven it. So if I do X, it would be terrible. I guess I won&#8217;t do X.</p></blockquote>
<p>So the behavior of our agent gets dominated by details of how it does the proof search (which have been obscured in the above argument, but obviously must play an important role) which is terrible.</p>
<h3>Blackmail and TDT</h3>
<p>Suppose you are running some reflectively consistent decision theory and I am considering blackmailing you (and you are in the agent&#8217;s situation, trying to maximize U and considering whether to take action X). I tell you:</p>
<blockquote><p>If you take action X, I will go out of my way to make sure that U is negative.</p></blockquote>
<p>and make what you consider a credible commitment to do so. You are deciding how to respond: based on my commitment, it is clear that taking action X will give you negative utility, so as a good consequentialist you shouldn&#8217;t. But if your decision theory is reflectively consistent, you will (probably) realize that your decision in this situation may have two different types of consequences. In the &#8220;real&#8221; world, it determines your &#8220;real&#8221; action and therefore your utility. But it <em>also</em> may influence the blackmailer&#8217;s predictions of your behavior. If the blackmailer will only attempt to blackmail you if he believes that you will give in, then <em>the mathematical fact</em> that you give in to blackmail will cause you to be blackmailed.</p>
<p>Weighing both of these consequences, you may decide that it is better not to give into the blackmailer. (Although this issue seems to be complex: if the blackmailer is also running TDT he will predict that you will try to understand the relationship between your cooperation and the blackmailer&#8217;s decision to blackmail, and if he predicts this correctly he may commit to blackmailing anyway so that you have &#8220;no hope&#8221; of avoiding the blackmail. But if you predict that the blackmailer will reason in this way,<em> you </em>will decide to commit to not cooperating even sooner, and so on. If anyone knows what happens, I don&#8217;t.) For example, if the blackmailer is running a consequentialist decision theory and is a good predictor, it seems clear that you shouldn&#8217;t give in (and that TDT as it has been informally described won&#8217;t).</p>
<h3>Counterfactual Blackmail</h3>
<p>Now suppose that you have a trusty &#8220;intuition module,&#8221; which is constrained to only output true statements but which is otherwise adversarial.</p>
<p>You are again deciding whether to take the action X, and your intuition module says to you:</p>
<blockquote><p>If you take action X, you will receive negative utility.</p></blockquote>
<p>If we view your intuition module as a separate entity, behaving adversarially, it seems clear that the situation is quite analogous to ordinary blackmail. If the statement is really true, then we shouldn&#8217;t take action X. But by <em>being the sort of person</em> who doesn&#8217;t take action X upon learning this fact, we are allowing the intuition module to honestly make such a statement. This is most clear if we operationalize the intuition model&#8217;s contract by saying it maximizes the utility function U&#8217; defined by:</p>
<blockquote><p>U&#8217; = -infinity if the intuition module ever lies.</p>
<p>U&#8217; = &#8211; U otherwise.</p></blockquote>
<p>Then the intuition module will need to &#8220;check&#8221; if it can honestly output &#8220;If you take action X, you will receive negative utility.&#8221; Our decision about taking action X then has two effects (just like our decision in the ordinary blackmailing): it determines our real action, and it determines whether our intuition module can manipulate us in this way.</p>
<h3>The Difficulty</h3>
<p>Things get substantially more difficult when the intuition module is internal to the agent. How can the agent distinguish between the output of the intuition module (which might be a lie if the agent is operating within the intuition module&#8217;s counterfactual) and the reasoning which justifies the agent&#8217;s response to the intuition module&#8217;s output? At some point some reasoning must be trusted as the basis for a decision; if we expect that every thought might be an artifact of a counterfactual than we descend into madness.</p>
<p>One approach to this problem is to have a system of trusted beliefs which is &#8220;too small&#8221; to permit these difficulties, and to treat the external environment as adversarial (though subject to some mathematical restrictions, which you can manipulate by means of your trusted beliefs). It may be possible to carry out some TDT reasoning using a very small set of beliefs which can be trusted, and to bootstrap up to larger agents with better understanding by manipulating the environment.</p>
<p>Nothing here can be stated formally, but here is the mental imagery behind this intuition. Consider a human operator, inside a &#8220;robot TDT&#8221; controlled from the inside. The human is able to handle some TDT-style reasoning, like the sort outlined above, but it cannot in general reason about statements complicated enough to talk about the robot it is controlling. Nevertheless, by using this sort of minimal TDT reasoning, it may be able to direct the robot sensibly/reflectively consistently. For example, if the human sees that the robot has produced a proof that &#8220;If I take X I will receive negative utility&#8221; the human can understand that it might just be in a counterfactual, so its decisions may have consequences other than the obvious ones.</p>
<p>(To the extent that this mode of thinking is useful, I expect it to be as an intuition pump for discovering what arguments should really motivate action. The human&#8217;s position seems to be little more than a restatement of our current situation with respect to decision theory, which intuition might do better on.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/97/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=97&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Confronting Incompleteness</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/25/the-truth-game-confronting-incompleteness/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/25/the-truth-game-confronting-incompleteness/#comments</comments>
		<pubDate>Sun, 25 Dec 2011 07:58:06 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Mathematical Logic]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=89</guid>
		<description><![CDATA[Consider the &#8220;Truth game&#8221;, played by an agent A: A outputs a sequence of mathematical assertions S1, S2, &#8230; For each statement S, A receives exp(-&#124;S&#124;) utilons. If A makes any false statements (regardless of how many it makes) it receives -infinity utilons (or just a constant larger than the largest possible reward). We can [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=89&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Consider the &#8220;Truth game&#8221;, played by an agent A:</p>
<blockquote><p>A outputs a sequence of mathematical assertions S1, S2, &#8230;</p>
<p>For each statement S, A receives exp(-|S|) utilons.</p>
<p>If A makes any false statements (regardless of how many it makes) it receives -infinity utilons (or just a constant larger than the largest possible reward).</p></blockquote>
<p><span id="more-89"></span></p>
<p>We can view A&#8217;s output in this game as an operationalization of &#8220;facts about which A is certain.&#8221; I believe that some intuitive judgments about &#8220;belief&#8221; are probably poorly grounded, and that thinking about a more concrete situation can patch some of these issues.</p>
<p>Naturally, it seems like A&#8217;s output should obey certain axioms. For example:</p>
<ul>
<li>If A outputs X and Y, then A outputs &#8220;X and Y.&#8221;</li>
<li>If A outputs X and &#8220;X implies Y&#8221;, then A outputs Y.  (&#8220;Modus Ponens&#8221;)</li>
<li>If A outputs X, then A outputs &#8220;X or Y.&#8221;</li>
</ul>
<p>And so on. For example, we could let A output some axioms in first order logic and then all of their logical consequences: then the set of statements output by A are precisely the theorems for a certain first order theory.</p>
<p>Humans have a strong intuitive feeling that they can &#8220;go beyond&#8221; any particular first order theory, in a sense formalized by Goedel&#8217;s incompleteness results. In particular, humans can feel pretty confident about &#8216;Anything that humans feel pretty confident about is true,&#8217; which is impossible if their confidence were only justified by provability in some formal system.</p>
<p>But let&#8217;s consider agent A playing the Truth game. Should A output a statement of the form &#8220;Everything A outputs is true&#8221;? Well, if the game is played in some reasonable language (say, first order statements about the integers) then A probably can&#8217;t articulate this sentence. But it has some good-enough analogs, like &#8220;If A outputs a statement S, A doesn&#8217;t output (not S)&#8221; which are fair game. Should A output these sentences?</p>
<p>If A ever outputs S and (not S), then it doesn&#8217;t matter what A does&#8211;it is getting -infinity utilons anyway. So A might as well assert that A never outputs both S and not(S). Let this statement be Con(A).</p>
<p>Another natural class of statements for A to output are the results of finite computations: if phi is a delta_0 formula (ie, if we can determine whether phi(x) is true using a deterministic computation) then any true statement of the form &#8220;There is some x such that phi(x)&#8221; should get output by A eventually&#8211;after all, eventually A might as well try every possible candidate x, and if any of them work A can promptly output &#8220;There is some x such that phi(x).&#8221; Call this statement Exhaustive(A). It seems clear that A might as well output Exhaustive(A).</p>
<p>Finally, it seems like A should be able to output &#8220;X implies Y&#8221; when there is a simple (say, strict finitist) proof that X implies Y. I&#8217;m not going to dwell on this because it doesn&#8217;t seem like it is either controversial nor actually problematic. Call this statement Prover(A).</p>
<p>Unfortunately, we are about to get into some trouble.</p>
<p>Consider the statement G = &#8220;A never outputs G.&#8221; It is easy for A to reason as follows:</p>
<blockquote><p>If A outputs G, then by simulating A, A will eventually output &#8216;A outputs G.&#8217; Thus A will have output G and (not G). Thus not Con(A).</p></blockquote>
<p>So if Prover(A), then A outputs &#8220;If A outputs G, then not Con(A)&#8221; which is the same statement as &#8220;If not G, then not Con(A)&#8221; which is in turn the same as &#8220;Con(A) implies G.&#8221; We&#8217;ve already argued that A might as well output &#8220;Con(A)&#8221;. So if we accept modus ponens we are saying that A might as well output &#8220;G&#8221;, in which case A will certainly receive infinite negative utility.</p>
<p>Which one of these legs should be dropped? Should A fail to output Con(A), should A fail to output Exhaustive(A), or should we abandon Prover(A) or modus ponens? If you were in A&#8217;s position you almost certainly wouldn&#8217;t output G. Which would you drop?</p>
<p>I would drop modus ponens, and I suspect that this is the path towards a satisfying theoretical resolution of the problem. When we view mathematical truth as fixed in the background, with A simply trying to discover it, modus ponens is easily justified (and decision-theoretic considerations make no difference). But A is itself made of math, and this seems like an extremely confused perspective, which we might seriously clarify mathematical logic by modifying (or we might not). A TDT agent in A&#8217;s shoes seems likely to output Con(A) and &#8220;Con(A) implies G&#8221; without outputting G, realizing that although G is true outputting it still isn&#8217;t advisable.</p>
<p>I will discuss this more in posts to come, but I should say that the real problem to me at this point seems to be: if you don&#8217;t have modus ponens, how do you do any reasoning at all?</p>
<p>(Also, let me say in passing that this analysis will carry over quite directly to the case of agents manipulating subjective uncertainties.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/89/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=89&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/25/the-truth-game-confronting-incompleteness/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
	</channel>
</rss>
