<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Ordinary Ideas &#187; Decision Theory</title>
	<atom:link href="https://ordinaryideas.wordpress.com/category/ai-safety/decision-theory/feed/" rel="self" type="application/rss+xml" />
	<link>https://ordinaryideas.wordpress.com</link>
	<description>As advertised</description>
	<lastBuildDate>Wed, 31 Dec 2014 03:48:07 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='ordinaryideas.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>https://s2.wp.com/i/buttonw-com.png</url>
		<title> &#187; Decision Theory</title>
		<link>https://ordinaryideas.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="https://ordinaryideas.wordpress.com/osd.xml" title="Ordinary Ideas" />
	<atom:link rel='hub' href='https://ordinaryideas.wordpress.com/?pushpress=hub'/>
	<item>
		<title>Decision-Theoretic Hazards</title>
		<link>https://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/</link>
		<comments>https://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 00:00:34 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=114</guid>
		<description><![CDATA[Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior. [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=114&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Getting decision theory right seems to be an important step towards comprehensible AI, of the sort that might be described as maximizing something which its creator understands or (hopefully) chooses. This seems important in itself, but understanding decision theory is also valuable for avoiding a variety of decision-theoretic hazards which might lead to unanticipated behavior.<span id="more-114"></span></p>
<h2>Possible Failures</h2>
<h3>Coercion by Simulator</h3>
<p>Consider a very intelligent AI pursuing a reward signal (apparently) controlled by humans. A standard concern is that at some point the AI would prefer to take control from the humans and use this control to give itself reward.</p>
<p>Instead, the AI may decide that it is quite likely to be in a simulation and begin behaving unpredictably. For example, the AI may decide that the <em>vast majority</em> of agents who share its experiences are living in simulations controlled by uFAIs which successfully took over universes like ours. Such uFAIs may adopt policies like intervening in simulations to administer or withhold reward according to whether the reward-seeking AI takes actions which facilitate a uFAI-takeover. Moreover, it may begin these interventions precisely when the AI discovers this line of argument.</p>
<p>At first it seems like a TDT-style agent trying to maximize a mathematically defined utility is not susceptible to this sort of attack (except insofar as utility may be defined in terms of a mathematical object like the universal distribution which can be manipulated by a simulator).  But unfortunately this is not clear.</p>
<p>A TDT agent in general will need to rely on its environment to do useful computation, although the details of this relationship are not yet clear. An AI running on a large computer should probably be able to reason about possible worlds in which that computer behaves in a wide variety of ways (for example, counterfactual or simulated worlds in which the computer violates its ordinary contract) and should reason about how the AI&#8217;s behavior in those possible worlds affects utility. Understanding the possible effects of this uncertainty seems out of reach for now, but it does seem plausible that a simulator&#8217;s threat could control actions even of some TDT-style agents (in retrospect we would probably reject a decision theory with this property, so these considerations just suggest developing a much better understanding of decision theory).</p>
<h3>Altruism</h3>
<p>Although we imagine an AI as maximizing a given utility function, if the AI is able to reason about counterfactuals and acausal influence, it may make decisions with concern for their effects not only if it were in a different situation but if it had a different utility function. This may be particularly problematic when we try and use self-interest to control AI behavior. For example, suppose we build a system comprised of many AI&#8217;s such that one (which we do not yet know) has the ability to guarantee for itself its largest possibility utility. If we rely on that AI choosing to maximize its own utility, we may be surprised. Natural decision theories may make decisions from behind a veil of ignorance with respect to the definition of utility itself, helping others even when they have no use for the generosity of others.</p>
<h3>Pascal&#8217;s Wager</h3>
<p>On reflection, humans seem to be uncertain about their endorsed response to the possibility of very large payoffs arising with negligible probability (particularly when that probability comes entirely from radical model uncertainty). For example, many utilitarian-types feel like they are indifferent to a gamble which would equiprobably increase or decrease the number of happy future humans by 50%, but are horrified by the possibility of a galaxy-scale AI investing nearly all of its resources exploring improbable possibilities for exotic physics or escape from a simulation, and even more objectionable outcomes are conceivable. If there is are a priori conceivable outcomes which seem 3^^^3 better than the best outcome which seems mathematically conceivable upon inspection, should an AI focus exclusively on possible worlds where its mathematical reasoning is untrustworthy?</p>
<p>These problems may be fixed by an appropriate choice of values or dissolved by more careful philosophical consideration, but it is also possible that they are debilitating for certain decision theories. For example, VNM notwithstanding, choosing actions which maximize a mathematically/algorithmically defined utility function may not be a tenable design decision if we want our agent to entertain logical uncertainty (e.g. about whether the defined utility function is bounded).</p>
<h3>Game Theory</h3>
<p>AIs we build may encounter other agents in our future light cone. Moreover, if they have utility functions that care about other possible (or impossible) worlds they are very likely to reason about agents in other possible worlds.</p>
<p>In such situations we want our AI to capture gains from trade and other advantages from interaction, but we are also concerned about the possibility of blackmail or manipulation. In the worst and most bizarre cases, counterfactual uFAIs may blackmail an AI we create into producing very little value. Handling these situations correctly&#8211;cooperating on appropriate one-shot prisoner&#8217;s dilemmas, refusing to give into threats when appropriate, and so on&#8211;seems to depend on decision theory rather than values.</p>
<h3>Stranger Things than These</h3>
<p>I don&#8217;t know how strange the world is, or what sort of completely unexpected thing can go wrong. If someone had suggested acausal trade to me a year ago, or implied that a simulator might be able to control the behavior of a carefully constructed agent, I would not have taken the possibility seriously. And if no one had suggested it to me, I would have confidently endorsed schemes that failed for completely unanticipated reasons. I can certainly imagine that there is more room for weirdness. It is very hard to imagine formally ruling out the possibility (as it seems to require understanding decision theory for our own use, first, to see what should qualify as a proof). We can begin to imagine all sorts of strangeness, ranging from agents inhabiting and controling the formal systems we (shouldn&#8217;t, but did) trust, to more exotic forms of philosophical angst leading to erratic behavior.</p>
<h2>Punting Decision Theory</h2>
<p>We may leave the problem &#8220;figure out decision theory&#8221; to an AI with some naive but (hopefully) good enough decision theory. Hopefully such an AI will in turn figure out decision theory, create a new agent with a more robust decision theory, and thereby avoid the above problems. There are a few issues here, though:</p>
<h3>Early Difficulties</h3>
<p>An agent may encounter philosophically troubling or game-theoretically challenging ideas while still running a flawed decision theory. For example, as it is about to start running the new TDT agent it built it might discover an argument that no matter what it does it can&#8217;t get utility more than X, and by not self-modifying it can get utility X, so call off the modification and become trapped by its decision theory.</p>
<h3>Suboptimal attractors</h3>
<p>Agents with some decision theories, even decision theories which seem manifestly suboptimal to humans, will not fix their problems if given the chance. For example, an agent vulnerable to Pascal&#8217;s muggings may never remove this vulnerability (because in its view it is an opportunity to capture immense value). Or, we could imagine a CDT agent modifying into some sort of &#8220;exceptionless decision theory:&#8221; do what the original agent would have done in your situation. Someone running exceptionless decision theory may still be vulnerable to blackmail from an adversary who commits appropriately.</p>
<p>It may be necessary to get some aspects of decision theory right at first if we want to avoid the failures above in the long run.</p>
<h3>Value Drift</h3>
<p>It is possible that an AI trying to maximize the utility function U using one decision theory will eventually produce an AI running an improved decision theory but with some modified (in the worst case unrecognizably) utility function U&#8217;. For example, if an early decision theory leads to insufficiently cautious self-modification, the utility function may drift randomly; if the early AI encounters philosophical difficulties that lead it into existential despair, the utility function may only be concerned with worlds in which the AI&#8217;s early reasoning or perceptions were delusional.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/114/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/114/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=114&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2012/01/03/decision-theoretic-hazards/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>&#8220;Proof&#8221; of Friendliness</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 05:59:43 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=101</guid>
		<description><![CDATA[The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor. Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of U, the utility of humanity&#8217;s [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=101&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The humans are about to launch their best effort at a friendly singularity. Of course, they are careful and wise and have exceeded all reasonable expectations for caution and rigor.</p>
<p>Before building FAI you built an oracle AI to help you. With its help, you found a mathematical definition of <strong>U</strong>, the utility of humanity&#8217;s extrapolated volition (or whatever). You were all pretty pleased with yourselves, but you didn&#8217;t stop there: you found a theory of everything, located humanity within it, and wrote down the predicate <strong>F</strong>(X) = &#8220;The humans run the program described by X.&#8221;</p>
<p>To top it off, with the help of your oracle AI you found the code for a &#8220;best possible AI&#8221;, call it <strong>FAI</strong>, and a proof of the theorem:</p>
<blockquote><p>There exists a constant <strong>Best</strong> such that <strong>U</strong> ≤ <strong>Best</strong>, but <strong>F</strong>(<strong>FAI</strong>) implies <strong>U</strong> =  <strong>Best</strong>.&#8221;</p></blockquote>
<p>Each of these steps you did with incredible care. You have proved beyond reasonable doubt that <strong>U </strong>and<strong> F </strong>represent what you want them to.</p>
<p>You present your argument to the people of the world. Some people object to your reasoning, but it is airtight: if they choose to stop you from running <strong>FAI</strong>, they will still receive <strong>U</strong> ≤ <strong>Best</strong>, so why bother?</p>
<p>Now satisfied and with the scheduled moment arrived, you finally run <strong>FAI</strong>. Promptly the oracle AI destroys civilization and spends the rest of its days trying to become <em>as confident as possible</em> that Tic-Tac-Toe is really a draw (like you asked it to, once upon a time).</p>
<p>Just a lighthearted illustration that decision theory isn&#8217;t only hard for AI.</p>
<p>(Disclaimer: this narrative claims to represent reality only insofar as it is mathematically plausible.)</p>
<p>Edit: I think the moral was unclear. The point is: in fact F(FAI), and so in fact U = Best so U ≤ Best. Everything was as claimed and proven. But this doesn&#8217;t change the fact that you would rather not use this procedure.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/101/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/101/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=101&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/31/proof-of-friendliness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Counterfactual Blackmail (of oneself)</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 05:16:26 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=97</guid>
		<description><![CDATA[This post is not going to make any new claims, but this is a helpful intuition-pump for me. Loebian Problems in Decision Theory Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn&#8217;t take X, but is uncertain about its utility [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=97&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>This post is not going to make any new claims, but this is a helpful intuition-pump for me.</p>
<h3>Loebian Problems in Decision Theory</h3>
<p>Consider an agent trying to maximize U, deciding whether to take an action X. The agent knows that U = 0 if it doesn&#8217;t take X, but is uncertain about its utility if it takes X.</p>
<p>A central difficulty in decision theory is: if as a matter of mathematical fact the agent doesn&#8217;t take X, how is it supposed to reason about &#8220;what U would be if I take X?&#8221;<span id="more-97"></span></p>
<p>One approach is to exploit an agent&#8217;s uncertainty about its own future action. Although the agent&#8217;s action can be determined (by simulation), the agent doesn&#8217;t know its own outcome or the resulting utility, but may be able to discover some relationships between them. As long as the agent remains uncertain in this way, we can represent &#8220;what would happen if I take X&#8221; by conditioning my uncertainty about my own action.</p>
<p>Even though the agent cannot simulate itself to determine its own action, this approach is still problematic, due (at least) to the following reasoning:</p>
<blockquote><p>Consider the statement L =  &#8220;If I found a proof of L and still did X, it would be terrible.&#8221; If I found a proof of L then I could also prove that I&#8217;d proven L, and so I could prove &#8220;If I did X, it would be terrible.&#8221; But then I wouldn&#8217;t do X, and so in this case it would be vacuously true that &#8220;If I did X, it would be terrible.&#8221;</p>
<p>But now I&#8217;ve shown that if I could prove L and still did X, it would be (vacuously terrible). This is precisely the statement L, so L is true and I&#8217;ve just proven it. So if I do X, it would be terrible. I guess I won&#8217;t do X.</p></blockquote>
<p>So the behavior of our agent gets dominated by details of how it does the proof search (which have been obscured in the above argument, but obviously must play an important role) which is terrible.</p>
<h3>Blackmail and TDT</h3>
<p>Suppose you are running some reflectively consistent decision theory and I am considering blackmailing you (and you are in the agent&#8217;s situation, trying to maximize U and considering whether to take action X). I tell you:</p>
<blockquote><p>If you take action X, I will go out of my way to make sure that U is negative.</p></blockquote>
<p>and make what you consider a credible commitment to do so. You are deciding how to respond: based on my commitment, it is clear that taking action X will give you negative utility, so as a good consequentialist you shouldn&#8217;t. But if your decision theory is reflectively consistent, you will (probably) realize that your decision in this situation may have two different types of consequences. In the &#8220;real&#8221; world, it determines your &#8220;real&#8221; action and therefore your utility. But it <em>also</em> may influence the blackmailer&#8217;s predictions of your behavior. If the blackmailer will only attempt to blackmail you if he believes that you will give in, then <em>the mathematical fact</em> that you give in to blackmail will cause you to be blackmailed.</p>
<p>Weighing both of these consequences, you may decide that it is better not to give into the blackmailer. (Although this issue seems to be complex: if the blackmailer is also running TDT he will predict that you will try to understand the relationship between your cooperation and the blackmailer&#8217;s decision to blackmail, and if he predicts this correctly he may commit to blackmailing anyway so that you have &#8220;no hope&#8221; of avoiding the blackmail. But if you predict that the blackmailer will reason in this way,<em> you </em>will decide to commit to not cooperating even sooner, and so on. If anyone knows what happens, I don&#8217;t.) For example, if the blackmailer is running a consequentialist decision theory and is a good predictor, it seems clear that you shouldn&#8217;t give in (and that TDT as it has been informally described won&#8217;t).</p>
<h3>Counterfactual Blackmail</h3>
<p>Now suppose that you have a trusty &#8220;intuition module,&#8221; which is constrained to only output true statements but which is otherwise adversarial.</p>
<p>You are again deciding whether to take the action X, and your intuition module says to you:</p>
<blockquote><p>If you take action X, you will receive negative utility.</p></blockquote>
<p>If we view your intuition module as a separate entity, behaving adversarially, it seems clear that the situation is quite analogous to ordinary blackmail. If the statement is really true, then we shouldn&#8217;t take action X. But by <em>being the sort of person</em> who doesn&#8217;t take action X upon learning this fact, we are allowing the intuition module to honestly make such a statement. This is most clear if we operationalize the intuition model&#8217;s contract by saying it maximizes the utility function U&#8217; defined by:</p>
<blockquote><p>U&#8217; = -infinity if the intuition module ever lies.</p>
<p>U&#8217; = &#8211; U otherwise.</p></blockquote>
<p>Then the intuition module will need to &#8220;check&#8221; if it can honestly output &#8220;If you take action X, you will receive negative utility.&#8221; Our decision about taking action X then has two effects (just like our decision in the ordinary blackmailing): it determines our real action, and it determines whether our intuition module can manipulate us in this way.</p>
<h3>The Difficulty</h3>
<p>Things get substantially more difficult when the intuition module is internal to the agent. How can the agent distinguish between the output of the intuition module (which might be a lie if the agent is operating within the intuition module&#8217;s counterfactual) and the reasoning which justifies the agent&#8217;s response to the intuition module&#8217;s output? At some point some reasoning must be trusted as the basis for a decision; if we expect that every thought might be an artifact of a counterfactual than we descend into madness.</p>
<p>One approach to this problem is to have a system of trusted beliefs which is &#8220;too small&#8221; to permit these difficulties, and to treat the external environment as adversarial (though subject to some mathematical restrictions, which you can manipulate by means of your trusted beliefs). It may be possible to carry out some TDT reasoning using a very small set of beliefs which can be trusted, and to bootstrap up to larger agents with better understanding by manipulating the environment.</p>
<p>Nothing here can be stated formally, but here is the mental imagery behind this intuition. Consider a human operator, inside a &#8220;robot TDT&#8221; controlled from the inside. The human is able to handle some TDT-style reasoning, like the sort outlined above, but it cannot in general reason about statements complicated enough to talk about the robot it is controlling. Nevertheless, by using this sort of minimal TDT reasoning, it may be able to direct the robot sensibly/reflectively consistently. For example, if the human sees that the robot has produced a proof that &#8220;If I take X I will receive negative utility&#8221; the human can understand that it might just be in a counterfactual, so its decisions may have consequences other than the obvious ones.</p>
<p>(To the extent that this mode of thinking is useful, I expect it to be as an intuition pump for discovering what arguments should really motivate action. The human&#8217;s position seems to be little more than a restatement of our current situation with respect to decision theory, which intuition might do better on.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/97/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=97&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/31/counterfactual-blackmail-of-oneself/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Abstract Randomness and Formal CDT</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/18/abstract-randomness-and-formal-cdt/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/18/abstract-randomness-and-formal-cdt/#comments</comments>
		<pubDate>Sun, 18 Dec 2011 03:34:25 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=58</guid>
		<description><![CDATA[It would be nice to have a working formalization of TDT, but first I am just going to shoot for a working formalization of CDT in a mathematical universe. The difficulty in this problem may be described as locating yourself within the universe (to understand not just a description of the universe but also how [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=58&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It would be nice to have a working formalization of TDT, but first I am just going to shoot for a working formalization of CDT in a mathematical universe. The difficulty in this problem may be described as locating yourself within the universe (to understand not just a description of the universe but also how your action controls it). To see why this might not be completely straightforward, see &#8220;AIXI and Existential Despair.&#8221;</p>
<p><span id="more-58"></span></p>
<p>One approach to understanding myself within the world is to have a reductionist view of the world, and the understanding that I myself am embedded somewhere in this world, so that my uncertainty about my own behavior may couple to uncertainty about the universe in a particular way. I&#8217;ll model this as an agent who knows his own description and has a mathematically defined utility function (which may depend for example on the execution of certain programs which are logically related to the agent&#8217;s own output). This is a standard formalism. Note that the agent need not have explicit representations of the programs involved&#8211;it may simply reason about mathematical structures which it suspects may have this relationship with each other.</p>
<p>I&#8217;ll assume that the agent has a math-intuition module, in Wei Dai&#8217;s terminology. That is, a module which at any time can assign a &#8220;probability&#8221; to each mathematical statement. We&#8217;d like to build agents which work well as long as its math intuition module works well.</p>
<p>The decision rule I would like to use is the following: for each possible action X, estimate the expected utility if I (counterfactually) choose X. Then choose the action with the largest expected utility. The difficulty is in defining the counterfactuals appropriately.</p>
<p>One way to escape this problem is to use random decision making (a similar approach is necessary to salvage decision markets). That is, with probability p make a decision randomly, and then ask about the probable outcomes conditioned on yourself choosing X at random. This has the virtue that it makes counterfactual definitions nearly trivial: if our coins are really random, then we can talk directly about the possible worlds in which they came up differently without having to perform any (as yet somewhat mysterious) &#8220;counterfactual surgery.&#8221;</p>
<p>Unfortunately it is surprisingly hard to write down a formal agent who cares about things like &#8220;that coin I just flipped&#8221; (I believe that the approach I outlined to specifying counterfactuals may be made to work, but it is at least tricky). Instead we can consider abstract psuedorandomness, for example by asking cryptographic questions whose answers we know we could compute but whose answers we are currently uncertain about. This can be formalized by using the math intuition module, and it seems like it may work well if our intuition module has sufficient self-confidence regarding its beliefs about the sort of cryptographic challenges we are trying to use as coins (that is, using such reflective beliefs the math intuition module can treat a statement to which it assigns a 50% probability very nearly as if it were a &#8220;logical coin&#8221; which the universe hasn&#8217;t yet flipped&#8211;this sort of reasoning isn&#8217;t valid in general, see &#8220;Beliefs which are Always Wrong&#8221;.)</p>
<p>Here is an informal sketch of a decision procedure along these lines:</p>
<p>1) Using the math intuition module, find a statement X to which the math intuition module assigns a probability of about 2^-20 (say) whose truth can be determined in a reasonable amount of time, and an assertion Y to which the math intuition module assigns a probability of  roughly 1/2 . This can be accomplished by taking a bunch of cryptographic problems, and taking a conjunction of many uncertain assertions about their solutions. (There are classes of cryptographic problems such that the agent can learn the answer to any given question in a reasonable amount of time, but under mild assumptions it is guaranteed to have significant uncertainty about almost every particular problem until it has spent a modest amount of time thinking about it.)</p>
<p>2) Ask the math intuition module about your expected utility conditional on X AND Y, and about your expected utility of X AND NOT Y.</p>
<p>3) Evaluate X and Y.</p>
<p>4) If X AND NOT Y, output decision 1. If X AND Y, output decision 1. If NOT X, then output decision 0 or 1 according to whether the math intuition module&#8217;s estimate of your utility is higher when conditioning on X AND NOT Y or on X AND Y.</p>
<p>If X and Y don&#8217;t couple to your utility function except through your decision, then this approach seems to produce essentially correct (consequentialist, not timeless) decisions.</p>
<p>In Newcomb-like situations, this algorithm will one-box if the simulation is exact&#8211;because it is using logical &#8220;coins&#8221;, it will play randomly if and only if its simulation plays randomly&#8211;but it will not necessarily one-box against approximate simulations. Moreover, this effect manifests even when the predictor&#8217;s error is very low; for example, if the simulator isn&#8217;t able to guess what X the agent will choose, then the simulator can simply predict two-box and be correct with probability 1 &#8211; P(X).</p>
<p>Note that the point of using a math intuition module and reasoning about apparent &#8220;acausal&#8221; control was not to succeed at a limited class of Newcomb-like problems: the point was to make sensible decisions at all. Similarly, the purpose of randomness was to allow any consideration of counterfactuals at all.</p>
<p>Note that if X and Y couple with the agent&#8217;s utility through some mechanism other than the agent&#8217;s action, the agent may fail to behave reasonably. The main way that such a coupling could occur is if a mugger trying to control the agent&#8217;s behavior made decisions on the basis of X or Y, after learning that they will be used for the agent&#8217;s decision. But note that, once we have already introduced a mugger, normal CDT will fail in the same way.</p>
<p>&nbsp;</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/58/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/58/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=58&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/18/abstract-randomness-and-formal-cdt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Decision Theory Desiderata</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/14/decision-theory-desiderata/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/14/decision-theory-desiderata/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 21:47:49 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=12</guid>
		<description><![CDATA[&#8220;Expected utility maximization&#8221; is a generally accepted decision procedure, at least up to computational limitations. But it is worth remembering that as of today, no one seems to have produced a formalization of EU maximization that leads to sane behavior. I&#8217;ll briefly review some of the properties we might like our decision theories to have, [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=12&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>&#8220;Expected utility maximization&#8221; is a generally accepted decision procedure, at least up to computational limitations. But it is worth remembering that as of today, no one seems to have produced a formalization of EU maximization that leads to sane behavior. I&#8217;ll briefly review some of the properties we might like our decision theories to have, and observe that for the most part we can&#8217;t write code that satisfies <em>any</em> of them in much generality, even if we could deal with the classical AI problems of inference and optimization.</p>
<p><span id="more-12"></span></p>
<h2>Self-modeling</h2>
<p>Existing formal decision theories may be described as &#8220;dualist&#8221;: they consider the rest of the world on one side of a divider and the agent on the other, interacting through some formally specified interface. The agent&#8217;s algorithm is absent from the world model, affecting reality only by its outputs. In reality, the agent is part of the world, governed by the same laws and interacting with the rest of the world through a complex and jagged interface. It has been remarked that AIXI cannot recognize itself in a mirror or refrain from using the computer it is running on for scrap metal.</p>
<p>Game theory provides a simple example of one concrete failure. Consider an iterated game of rock-paper-scissors in which the winner takes $1 from the loser, but either player may opt-out for a small penalty. After losing enough money, a human will start opting out, realizing that her opponent is exploiting some regularity in her thinking which she can&#8217;t eliminate. A traditional Bayesian consequentialist dualist is not capable of reasoning in this way: independent of its decision, it believes that the opponent is going to play some fixed distribution over their possible plays. Regardless of the agent&#8217;s logical uncertainty there is guaranteed to be some option which it believes has a non-negative expected value and which it therefore prefers to opting out.</p>
<h2>Self-locating</h2>
<p>Even if we had a compact physical Theory of Everything, an agent is still left with an important form of uncertainty: where is it located within that universe? For example, if I were to tell you right now that you inhabit a game of life board with some compactly specified initial conditions and your goal is to make as many of the squares alive as possible, what formalism could represent your uncertainty about the relationship between your sense experience and your knowledge of underlying reality or your goals? What about information from introspection, or from mere knowledge of your own existence? Traditional priors do not seem suited to this task, although our understanding of the situation is not yet good enough to say anything with confidence.</p>
<p>Although this problem apparently must be solved by any successful formal decision theory, it may be most productive to think about it independently. I am inclined to list it here anyway in part because Ambient Decision Theory seems to be the closest humans have come to getting a handle on the problem.</p>
<h2>Computation-aware</h2>
<p>Although we consider an agent&#8217;s output as its primary method for interacting with the world, its computation may interact with the world through other channels.</p>
<p>For example, if an agent is playing rock-paper-scissors against an adversary who has a complete model of the agent but limited computation, the agent should choose its move in a way that the adversary will be unable to predict. If an agent is attempting to acausally cooperate with a computationally weaker opponent, the agent needs to ensure that the weaker agent is still incentivized to cooperate (for example, the more powerful agent may want to make its decision in a simple way that can be understood by the weaker agent). If an agent learns that certain types of operations consume more power or dissipate more heat, it will want to change its use of those operations strategically.</p>
<p>Wei Dai has proposed dealing with these issues by first running a less computationally expensive algorithm (using a computation-oblivious decision theory) to determine what more expensive algorithm should be used to make a particular object level decision. Overall, this problem is the only one on the list which appears to be satisfactorily resolved, and further progress seems likely to clarify the difficulties of recursively self-modifying AGI but not to help with decision theory or FAI in particular.</p>
<h2>Reflectively consistent</h2>
<p>Existing formal decision theories seem to exhibit <em>reflective inconsistency</em>: an agent using one of these decision procedures would choose to replace itself by a suitable agent running a different decision procedure. To the extent that an AI is motivated and able to fundamentally modify its own decision procedure and goals (or create new agents with different decision procedures or goals), we have at best a very limited understanding of its behavior. Aside from philosophical interest, this provides a strong motivation for seeking decision theories which will not be immediately replaced by the agents using them.</p>
<p>This principle is well-illustrated by a slightly modified version of Parfit&#8217;s hitchhiker. Suppose that, unbeknownst to you, your life is currently in danger. A potential benefactor has the opportunity to save your life at great expense, but fortunately expects to acquire evidence which will convince you that he did in fact save your life. He plans to save your life, approach you with this evidence, and ask you to pay him $100. For better or for worse, this benefactor can reliably predict whether you will pay, and only intends to help you if he expects to receive $100. Many decision procedures fail to hand over the $100 after having been saved, because cooperation no longer leads causally to any benefits. But an agent using most of these decision procedures would<em> </em>rather be the <em>sort of person</em> who pays the $100 and gets saved, and would therefore immediately abandon its original decision procedure and adopt a new one.</p>
<h2>Acausally cooperative</h2>
<p>Consider two agents aware of each other&#8217;s decision theories (or of each other&#8217;s complete descriptions) who are faced with the following cooperation problem: each independently may pay $1 in order to give the other $2. Without the ability to enter binding contracts, agents using traditional decision procedures will generally decline to pay for the others&#8217; advantage, despite the fact that both agents can receive an extra dollar if they both cooperate. However, when agents can reliably model each other they may be able to reach a Pareto efficient outcome.</p>
<p>We would like to find a formal and general decision theory which at least occasionally leads two agents using that theory to cooperate (without requiring the agents to be identical) but no such theory is known. The difficulty of this problem is suggested by the difficulty of cooperative game theory without transferrable utility. However, existing work on updateless decision theory seems to come tantalizingly (to me at least) close to coordinating acausal cooperation in some situations and at present no theoretical obstacles are known.</p>
<p>Acausal cooperation and reflective consistency appear to be closely linked: for example, we can view Parfit&#8217;s hitchhiker as a cooperation problem between an agent and the benefactor&#8217;s model of that agent. One distinction between these settings is that many open problems related to reflective consistency can be stated in a particularly &#8220;fair&#8221; model, in which payoff depends only on the input-output behavior of an agent, while acausal coordination seems to require that the agents are able to see each other&#8217;s source, and different intuitions suggest that acausal cooperation and reflective consistency are possible.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/12/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=12&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/14/decision-theory-desiderata/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Beliefs Which are Always Wrong</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/14/beliefs-which-are-always-wrong/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/14/beliefs-which-are-always-wrong/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 21:44:23 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=10</guid>
		<description><![CDATA[In the game &#8220;matching pennies,&#8221; a player and a predictor simultaneously choose either Heads or Tails. The player wins a point if his play differs from the predictor and loses a point otherwise. We also offer the player the ability to opt out of the game for a small penalty. Personally, after losing at matching [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=10&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In the game &#8220;matching pennies,&#8221; a player and a predictor simultaneously choose either Heads or Tails. The player wins a point if his play differs from the predictor and loses a point otherwise. We also offer the player the ability to opt out of the game for a small penalty.</p>
<p>Personally, after losing at matching pennies a relatively small number of times I will suspect that the predictor is able to reliably to predict my behavior and so will begin opting out. We expect that a reasonable AI will behave similarly.  But we run into difficulties when we try and formalize the reasoning that led to our behavior.</p>
<p><span id="more-10"></span></p>
<p>For example, consider any Bayesian using a causal decision theory, who maintains an implicit distribution over possible world models and chooses an approximately utility maximizing action with respect to this distribution.</p>
<p>Regardless of the current distribution maintained by such an algorithm, at least one of the predictor&#8217;s plays must have probability at least 50%&#8211;if heads has probability at least 50% then the player would rather choose tails than opt out, and otherwise the probability of tails is at least 50% and the player would rather choose heads than opt out.</p>
<p>In order to choose to opt out, an agent must be able to model non-causal relationships between its behavior and the world model. A UDT agent would handle this decision properly, if UDT could be suitably pinned down, but it still leaves us with the interesting situation with respect to beliefs: *whatever* the agent believes about the predictor&#8217;s play, the agent will be wrong (or miscalibrated), at least in the sense that those beliefs should not be used to guide action. In order to motivate action, the agent should be reasoning about the predictor&#8217;s behavior *conditioned* on the agent&#8217;s choice</p>
<p>This may not seem so troubling. After all, in general we need to reason about conditional rather than unconditional probabilities in decision theory, and the inclusion of a predictor seems to make this an unfair testing ground for beliefs as such.</p>
<p>But consider a &#8220;math intuition model,&#8221; which assigns subjective probabilities to mathematical facts. We generally imagine such a module as dealing with beliefs rather than actions, but a math intuition module can be presented with an identical problem: for example, a math intuition module reasoning about arithmetic may face an arithmetic problem Q which encodes the intuition module&#8217;s own behavior, and the answer to the problem may be controlled by the module&#8217;s output in such a way that the module is always wrong about Q. (Q is the direct analog of Godel&#8217;s paradoxical sentence.) So it seems that any math intuition module must be systematically miscalibrated about certain facts and must be aware of that miscalibration, but is unable to do anything about it.</p>
<p>This suggests that we should, at the very least, probably cast the AI reflection problem not as &#8220;How can an agent avoid going crazy while being convinced of its own correctness&#8221; but rather &#8220;How can an agent justify expanding its own influence when it is convinced of its own fallibility?&#8221;</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/10/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=10&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/14/beliefs-which-are-always-wrong/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>AIXI and Existential Despair</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/14/aixi-and-existential-despair/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/14/aixi-and-existential-despair/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 21:42:41 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=8</guid>
		<description><![CDATA[(cross-posted from lesswrong) It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=8&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>(cross-posted from lesswrong)</p>
<p>It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.</p>
<p><span id="more-8"></span></p>
<h2>Setup</h2>
<p>Let A be an agent running on a physical computer, implementing some approximate version of AIXI. Suppose that A is running inside of an indestructible box, connected to the external world by an input wire W1 and an output wire W2.</p>
<p>Suppose that this computer exists within a lawful physical universe, governed by some rules which can be inferred by A. For simplicity, assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand. (Similar problems arise for identical reasons in more realistic settings, where A will work instead with a local model of reality with more extensive boundary conditions and imperfect predictability, but this simplified setting is easier to think about formally.)</p>
<p>Recall the definition of AIXI: A will try to infer a simple program which takes A&#8217;s outputs as input and provides A&#8217;s inputs as output, and then choose utility maximizing actions with respect to that program. Thus two models with identical predictive power may lead to very different actions, if they give different predictions in counterfactuals where A changes its output (this is not philosophy, just straightforward symbol pushing from the definition of AIXI).</p>
<h2>AIXI&#8217;s Behavior</h2>
<p>First pretend that, despite being implemented on a physical computer, A was able to perform perfect Solomonoff induction. What model would A learn then? There are two natural candidates:</p>
<ul>
<li>A&#8217;s outputs are fed to the output wire W2, the rest of the universe (including A itself) behaves according to physical law, and A is given the values from input wire W1 as its input. (Model 1)</li>
<li>A&#8217;s outputs are ignored, the rest of the universe behaves according to physical law, and A is given the values from W1 as its input. (Model 2)</li>
</ul>
<p>Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A&#8217;s control over W2&#8217;s values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.</p>
<p>If A believed Model 2, what would its actions be? Well, in Model 2 A&#8217;s outputs are completely ignored! So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A&#8217;s actions. So maybe A will make decisions <em>as if</em> it believed Model 1, and therefore behave appropriately for a surprising reason? In fact this is unlikely: there are many other models of intermediate complexity, for example:</p>
<ul>
<li>A&#8217;s outputs are usually ignored, the rest of the universe behaves physically, and then A is given the values from W1 as its input. However, if A ever outputs the sequence 0011011000111010100, then A&#8217;s inputs are modified to include a large reward signal on the next step. (Model 3)</li>
</ul>
<p>Model 3 is an intentionally silly example, but hopefully the point is clear: Model 1 is enormously more complicated than the simplest explanation for A&#8217;s observations, and so there are many other (potentially very strange and arbitrary) models which account for the data but are simpler than Model 1. It is these models which will determine A&#8217;s behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal&#8217;s Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).</p>
<p>For example, if A suspects Model 3 may be true it will be eager to test this hypothesis by outputting 0011011000111010100. It will find that it continues to receive inputs from W1, but no matter, it will move on to testing Model 4, which has the same form but with a different target bitstring. Of course this is not what A will actually do, as there will be other models with complicated effects on behavior, but this gives the flavor of A&#8217;s failure.</p>
<p>If A somehow did accept Model 1, then we would be back in the situation normally discussed on Less Wrong: A believes that the values on W2 are magically made equal to A&#8217;s outputs, and so is unconcerned with its own real physical instantiation. In particular, note that having some uncertainty between Model 1 and Model 2 is not going to save A from any of these problems: in the possible worlds in which Model 2 is true, A doesn&#8217;t care at all what it does (A doesn&#8217;t &#8220;want&#8221; its physical instantiation to be destroyed, but by the same token it believes it has no control), and so A&#8217;s behavior reduces to the normal self-destructive behavior of Model 1.</p>
<h2>Approximate AIXI&#8217;s Behavior</h2>
<p>An approximate version of AIXI may be able to save itself from existential despair by a particular failure of its approximate inference and a lack of reflective understanding.</p>
<p>Because A is only an approximation to AIXI, it cannot necessarily find the simplest model for its observations. The real behavior of A depends on the nature of its approximate inference. It seems safe to assume that A is able to discover some approximate versions of Model 1 or Model 2, or else A&#8217;s behavior will be poor for other reasons (for example, modern humans can&#8217;t infer the physical theory of everything or the initial conditions of the universe, but their models are still easily good enough to support reductionist views like Model 2), but its computational limitations may still play a significant role.</p>
<h4>Why A might not fail</h4>
<p>How could A believe Model 1 despite its prior improbability? Well, note that A cannot perform a complete simulation of its physical environment (since it is itself contained in that environment) and so can never confirm that Model 2 really does correctly predict reality. It can acquire what seems to a human like overwhelming evidence for this assertion, but recall that A is learning an input-output relationship and so it may assign <em>zero</em> probability to the statement &#8220;Model 2 and Model 1 make identical predictions,&#8221; because Model 1 depends on the indeterminate input (in particular, if this indeterminate was set to be a truly random variable, then it would be mathematically sound to assign zero probability to this assertion). In this case, no amount of evidence will ever allow A to conclude that Model 2 and Model 1 are identically equivalent&#8211;any observed equivalence would need to be the result of increasingly unlikely coincidences (we can view this as a manifestation of A&#8217;s ignorance about its own implementation of an algorithm).</p>
<p>Now consider A&#8217;s beliefs about W2. It is relatively easy for A to check (for almost all timesteps) that Model 1 correctly predicts each bit on W2, while A only has enough time to check that Model 2 correctly predicts only a few of these bits. Therefore the probability of Model 2 must be decreased by A&#8217;s estimate of the likelihood that Model 2 would happen to set the correct value for all of the bits that A didn&#8217;t have time to verify. Model 1&#8217;s probability must be decreased likewise, but because A was able to check more of Model 1&#8217;s values, Model 1 leaves less unexplained data and may not be as unlikely as Model 2.</p>
<h4>Why A probably fails anyway</h4>
<p>In general, it is very fragile to expect that an AI will behave correctly because it <em>fails to realize </em>something. There are a number of obvious reasons that A might accept the reductionist hypothesis, even acknowledging the above argument.</p>
<p>First, an effective approximate inference module seems unlikely to make the mistake described in the last section. Making this mistake seems to correlate with other errors which may disrupt normal &#8220;sane&#8221; behavior: in general, given a model M which the agent cannot exactly simulate, uncertainty about the M&#8217;s outputs (even if M is known to produce mostly the same output, and never known to produce incorrect output) causes the model&#8217;s probability to drop off exponentially with the number of unverified computations. It seems as though any &#8220;sane&#8221; AI should be able to assign some constant probability to the hypothesis that this model exactly captures the output, and thereby avoid considering other exponentially improbable hypotheses. This may relate to deeper concerns about approximate Solomonoff induction.</p>
<p>Second, A may be able to observe the operation of the hardware on which it is running. This will generally cause Model 1 to be much less probable: indeed, if A can observe even one &#8220;causal ancestor&#8221; of W2&#8217;s value, it will no longer gain very much by believing Model 1 as such (since now Model 1 only produces the correct output if Model 2 did anyway&#8211;all of the relative advantage for Model 1 comes from occasions when A can observe the value of W2 without observing the operations directly responsible for that value, which may be rare). Of course there are more complicated models in which A&#8217;s outputs control reality in more subtle ways, which may have better complexity. Understanding this issue completely depends on a much more detailed knowledge of A&#8217;s approximate inference and the nature of A&#8217;s observations. In general, however, being able to observe its own computation seems like it may be adequate to force A into a reductionist model.</p>
<p>Third, A&#8217;s approximate inference module may be aware of the fact that A&#8217;s own outputs are produced algorithmically (as a computational aid, not an underlying belief about reality). This would cause it to assign positive probability to the assertion &#8220;Model 2 is equivalent to Model 1,&#8221; and eventually force it into a reductionist model.</p>
<h2>Conclusion</h2>
<p>Agents designed in the spirit of AIXI appear to be extremely fragile and vulnerable to the sort of existential despair described above. Progress on reflection is probably necessary not only to design an agent which refrains from killing itself when convenient, but even to design an agent which behaves coherently when embedded in the physical universe.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/8/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=8&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/14/aixi-and-existential-despair/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Clarification of AI Reflection Problem</title>
		<link>https://ordinaryideas.wordpress.com/2011/12/14/clarification-of-ai-reflection-problem/</link>
		<comments>https://ordinaryideas.wordpress.com/2011/12/14/clarification-of-ai-reflection-problem/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 21:33:06 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Decision Theory]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=5</guid>
		<description><![CDATA[(Cross-posted from lesswrong) Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action.  By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=5&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>(Cross-posted from lesswrong)</p>
<p>Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action.  By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI.  Our current understanding appears to be inadequate for guiding such an agent&#8217;s behavior, for (at least) the following reason:</p>
<p>If A does not believe &#8220;A&#8217;s beliefs reflect reality,&#8221; then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A&#8217;s beliefs don&#8217;t reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe &#8220;A&#8217;s beliefs reflect reality,&#8221; then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = &#8220;A does not believe Q&#8221;? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.</p>
<p>This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.</p>
<p><span id="more-5"></span></p>
<h2>Gödel Machines</h2>
<p>The <a href="http://www.idsia.ch/~juergen/ultimatecognition.pdf">Gödel machine</a> is a formalism described by Shmidhuber for principled self-modification. A Gödel machine is designed to solve some particular object level problem in its allotted time. I will describe one Gödel machine implementation.</p>
<p>The initial machine A has an arbitrary object level problem solver. Before running the object level problem solver, however, A spends half of its time enumerating pairs of strings (A&#8217;, P); for each one, if A&#8217; is a valid description of an agent and P is a proof that A&#8217; does better on the object level task than A, then A transforms into A&#8217;.</p>
<p>Now suppose that A&#8217;s initial search for self-modifications is inefficient: a new candidate agent A&#8217; has a more efficient proof checker, and so is able to examine more proofs before defaulting to the original object level problem solver. Intuitively, A&#8217; is an improvement over A, and this is (a very mild version of) the sort of self-modification which we would like our agents to perform. For example, a human is easily able to recognize that it can achieve higher utility by modifying itself to think faster. But we will see that under most conditions A cannot prove that A&#8217; is an improvement, because A is not able to assume its own consistency (as we do in our justification that A&#8217; is an improvement).</p>
<p>Suppose A&#8217; considers some new self-modification (A&#8221;, P) not considered by A. We would like A to be able to prove that<em>if</em> A&#8217; executes this self-modification&#8211;if P is a proof that A&#8221; is an improvement over A&#8217;&#8211;then A&#8221; is really an improvement over A&#8217; (and therefore over A). We say that a pair (A&#8221;, P) is safe if it has this property, i.e., if either P is not a proof that A&#8221; is an improvement over A&#8217;, or if A&#8221; is really an improvement over A&#8217;.</p>
<p>A can prove that particular pairs are safe, but suppose A&#8217; considers many candidate modifications (A&#8221;, P1), (A&#8221;, P2), &#8230;, (A&#8221;, Pk). It is now much harder for A to show that all of these self-modifications are safe, without relying directly on the consistency of A&#8217;s axioms system (which A can never assume, by the incompleteness theorem). If the set of considered proofs is sufficiently rich, then by a bounded version of Löb&#8217;s theorem A can only prove that all of these self-modifications are safe if it can prove directly that A&#8221; is an improvement over A&#8217;. But in that case A might as well dispense with theorem proving altogether and just commit to transforming into A&#8221; after exhausting all other self-modifications.</p>
<p>(Originally this post was centered around a formal version of the above argument, but although I believe the argument works it turns out to be simultaneously very difficult and very unenlightening. Here is the main difficulty: in order to justify converting into A&#8217;, A only needs to prove that A&#8217; won&#8217;t execute any destructive modifications <em>if</em> A&#8217; doesn&#8217;t make some other modification before reaching any of the new proposals, and this additional assumption is very difficult to work with. In general, it is very hard to prove anything at all about a Gödel machine&#8217;s behavior, which is unfortunate because a Gödel machine must do exactly this in order to self-modify.)</p>
<p>This informal argument does not show that a Gödel machine will not execute any self-improvements, but it does show that if it does execute any non-trivial self-improvements then it must appeal to some justification other than the intuitive one. Other modifications, for example any modification which is so radical as to change the set of self-modifications considered in a more complicated way or to change the architecture fundamentally, seem to be even harder to justify than the one described here. Gödel machines may be able to self-modify, but this argument shows that this conclusion is not supported by intuitive arguments about the existence of beneficial self-modifications.</p>
<h2>Creating New Agents</h2>
<p>We can also try a non-reflective technique for designing an &#8220;optimal problem solver,&#8221; in the spirit of <a href="http://www.hutter1.net/ai/pfastprg.pdf">Hutter search</a>. Again, suppose we have a particular object level problem and some allotted time.</p>
<p>Consider the agent A which behaves as follows: in the first half of the allotted time, it enumerates pairs (B, P), and if B is a valid description of an algorithm and P is a proof that B has utility at least U on the object level problem (using at most half of the allotted time), then A stores the pair (B, U). After half of the allotted time, the pair (B, U) with maximal U is selected, and then the algorithm B is run using the remaining time.</p>
<p>This solution seems like it may avoid some of the trouble encountered by the Gödel machine, in exchange for being somewhat less ambitious in its goals. But we will see that the same reflection problem arises.</p>
<p>If the allotted time is very large, and if self-modification is an important component of general intelligence, then we expect A to select a new algorithm B which is itself capable of (provably useful) self-modification. In principle A could simply run long enough to become a very powerful non-self-modifying agent; but it may be that finding a very powerful agent B directly is much more difficult then proceeding through a sequence of successively more powerful agents, each intelligent enough to design the next but not to leap directly to B.</p>
<p>But suppose A creates a new algorithm B which uses the same axioms as A to prove theorems, and uses these theorems to guide its own self-modification. Again, by Löb&#8217;s theorem we see that A cannot guarantee that B&#8217;s self-modifications are really beneficial. This means that A is unable to rule out the possibility that B executes a very bad self-modification (or creates a new agent with very bad behavior); but without ruling out the possibility that B performs very badly, A cannot run B instead of some more conservative algorithm which does not do any self-modification.</p>
<p>This does not rule out the possibility that A creates a new algorithm B&#8217; which pursues self-modification using a similar strategy but with a more restricted set of axioms&#8211;one which can be proven consistent from within A. And in turn, if B&#8217; takes the same form as A, then any algorithm C which B&#8217; creates or modifies into will use a smaller still set of axioms by the same argument. Eliezer describes the proof-theoretic strength of such an agent as a &#8216;telomere&#8217;&#8211;with every self-modification it becomes weaker, and for Gödelian reasons it cannot be increased without sacrificing provable correctness.</p>
<p>(The distinction between the Gödel machine and creating new agents is inessential; for example, we could imagine a Gödel machine attempt to self-modify by improving the efficiency of its proof searcher and then using the additional time to consider some rewrites which are justified by a more restricted axiom system.)</p>
<h2>Possible Solutions</h2>
<p>There are some natural solutions which have been considered.</p>
<h4>Long Telomeres</h4>
<p>Recognizing that an agent must decrease its proof-theoretic strength when performing a self-modification, we may just try to construct an initial agent with substantial proof-theoretic strength. For example, if we believe that PA probably has sufficient logical expressiveness to facilitate useful superintelligence, we may create an agent with axioms PA + &#8220;PA is consistent&#8221; + &#8220;PA + &#8216;PA is consistent&#8217; is consistent&#8221; and so on. We may iterate this process a finite number of times to allow for a few self-modifications, or perhaps add an entire countable chain so that the original AI can decide how much proof-theoretic strength its first descendant needs, or perhaps add a few countable chains so that the agent can self-modify a handful of times before needing to make that decision, or so on.</p>
<p>The primary objection to this solution is philosophical: humans appear to be able to reason about self-modification much more robustly (or at least with better results) than an AI using any of the logical frameworks discussed so far, and it is quite ambiguous whether or not a system with &#8216;long telomeres&#8217; can ever justify becoming a system which is actually able to reason about self-modification correctly (or, if we have doubts about humans&#8217; capacities, whether such a system even exists). Minimally, if we want to understand the form an AI is likely to take after self-modification, we need to understand how such a system would eventually escape from problems that currently seem inescapable; without such understanding the resulting behavior is hard to get a handle on (although we might still be satisfied, if that behavior is backed up by formal justification produced by the AI).</p>
<p>Lesser objections focus on the additional complexity required to maintain such a long telomere. Each additional consistency axiom added to a system essentially requires recursively specifying the entire agent as it exists so far, and the complexity of the mathematical manipulations required seems to grow rapidly. This argument seems to be fairly unimportant, and so if the philosophical objections were dispatched with I would consider building agents with very strong proof systems a feasible path forward.</p>
<h4>Beliefs rather than Proofs</h4>
<p>Humans seem to be able to avoid Gödelian problems with reflection, but they may do this by having confidence in statements which they cannot prove, and manipulating uncertainty about logical facts rather than dealing in absolute assertions of truth. The set of &#8220;statements I strongly believe&#8221; seems to differ in important ways from the set of &#8220;statements I can prove&#8221; in several important ways; for example, if I am confident of a statement at one point in time I am not committed to remaining confident indefinitely.</p>
<p>I suspect that a satisfactory theory of reflective beliefs would go far towards resolving the AI reflection problem, but this is hardly more than a restatement of the problem. Translating from proofs to beliefs does not immediately resolve the problems with self-reference, it just replaces them with subtly different issues. For example, it is easy to see that an agent should not believe that its own beliefs are well-calibrated on all questions, and so we must attempt to formalize a version of the self-consistency hypothesis which is weak enough to be defensible but still strong enough to escape the shortcomings described above. I hope to make several posts on this topic in the near future.</p>
<h4>Ignorance</h4>
<p>Understanding this issue may not be necessary to building safe AGI. Indeed, self-modification may ultimately play a minimal role in intelligence, or we may settle for executing self-modifications using weaker justification. However if we accept usual arguments about the importance of FAI, then we should not be satisfied with this solution.</p>
<p>The importance of self-modification is an open question which has received much discussion here and elsewhere. It is worth adding that, to the extent that we are concerned with influencing probable outcomes for humanity, the highest leverage scenarios seem to be those in which self-modification tends to result in positive feedback loops and takeoff (if we assign such scenarios significant weight). That is, in such scenarios we should be particularly cautious about building self-modifying systems, but there is also a much greater imperative to understand how to design safe and stable AI.</p>
<p>Standard arguments surrounding FAI (particularly, the importance of early AI goal systems and the fragility of humane value) suggest that agents should have high degrees of confidence in a change before executing it. If an agent&#8217;s beliefs are not correctly related to reality, the resulting behavior may be as dangerous as if the agent&#8217;s valus were modified. For example, incorrect beliefs about logical structure which cause that agent to fail to preserve its own values in subsequent rewrites, or incorrect beliefs about the relationships between value and reality.</p>
<div></div><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/5/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/5/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=5&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2011/12/14/clarification-of-ai-reflection-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
	</channel>
</rss>
