<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Ordinary Ideas &#187; Uncategorized</title>
	<atom:link href="https://ordinaryideas.wordpress.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>https://ordinaryideas.wordpress.com</link>
	<description>As advertised</description>
	<lastBuildDate>Wed, 31 Dec 2014 03:48:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='ordinaryideas.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>https://s2.wp.com/i/buttonw-com.png</url>
		<title> &#187; Uncategorized</title>
		<link>https://ordinaryideas.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="https://ordinaryideas.wordpress.com/osd.xml" title="Ordinary Ideas" />
	<atom:link rel='hub' href='https://ordinaryideas.wordpress.com/?pushpress=hub'/>
	<item>
		<title>Recent thoughts</title>
		<link>https://ordinaryideas.wordpress.com/2014/12/30/recent-thoughts/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/12/30/recent-thoughts/#comments</comments>
		<pubDate>Tue, 30 Dec 2014 05:44:59 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=436</guid>
		<description><![CDATA[I’ve recently spent some more time thinking about speculative issues in AI safety: Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising. A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=436&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p class="p1">I’ve recently spent some more time thinking about speculative issues in AI safety:</p>
<ol>
<li class="p1">Ideas for building useful agents without goals: <a href="https://medium.com/@paulfchristiano/model-free-decisions-6e6609f5d99e">approval-directed agents</a>, <a href="https://medium.com/@paulfchristiano/approval-directed-bootstrapping-5d49e886c14f">approval-directed bootstrapping</a>, and <a href="https://medium.com/@paulfchristiano/optimization-and-goals-ca524f745852">optimization and goals</a>. I think this line of reasoning is very promising.</li>
<li class="p1">A formalization of one piece of the AI safety challenge: <a href="https://docs.google.com/document/d/1_ggFw8KbvW77Z3gCQUDyz3_IrR3pVyFZ2wkuBgMvoVU/edit?usp=sharing">the steering problem</a>. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.</li>
<li class="p1">Some ideas for getting useful work out of self-interested agents, based on arguments: <a href="https://medium.com/@paulfchristiano/of-arguments-and-wagers-ee16a0e84cf7">of arguments and wagers</a>, <a href="https://ordinaryideas.wordpress.com/2014/07/18/adversarial-collaboration/">adversarial collaboration</a> [older], and <a href="https://medium.com/@paulfchristiano/delegating-to-a-mixed-crowd-dda2b8e22cd8">delegating to a mixed crowd</a>. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.</li>
</ol>
<p class="p1">I’m excited about a few possible next steps:</p>
<p class="p1"><span id="more-436"></span></p>
<ol>
<li class="p1">Under the (highly improbable) assumption that various deep learning architectures could yield human-level performance, could they also predictably yield safe AI? I think we have a good chance of finding a solution&#8212;i.e. a design of plausibly safe AI, under roughly the same assumptions needed to get human-level AI&#8212;for some possible architectures. This would feel like a big step forward.</li>
<li class="p1">For what capabilities can we solve the steering problem? I had originally assumed none, but I am now interested in trying to apply the ideas from the approval-directed agents post. From easiest to hardest, I think there are natural lines of attack using any of: natural language question answering, precise question answering, sequence prediction. It might even be possible using reinforcement learners (though this would involve different techniques).</li>
<li class="p1">I am very interested in implementing effective debates, and am keen to test some unusual proposals. The connection to AI safety is more impressionistic, but in my mind these techniques are closely linked with approval-directed behavior.</li>
<li class="p1">I’m currently writing up a concrete architecture for approval-directed agents, in order to facilitate clearer discussion about the idea. This kind of work that seems harder to do in advance, but at this point I think it’s mostly an exposition problem.</li>
</ol><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/436/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/436/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=436&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/12/30/recent-thoughts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Confronting Gödelian difficulties (reprise)</title>
		<link>https://ordinaryideas.wordpress.com/2014/08/30/confronting-godelian-difficulties-reprise/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/08/30/confronting-godelian-difficulties-reprise/#comments</comments>
		<pubDate>Sat, 30 Aug 2014 03:43:06 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=413</guid>
		<description><![CDATA[My current attitude towards the Löbian obstacle is &#8220;just live with it.&#8221; This post outlines that view and some of the underlying intuitions. To show I&#8217;m being a good sport, I&#8217;ll also propose a new reflection principle. But my best guess for the right answer remains &#8220;live with it.&#8221; Gödelian arguments for probabilistic reasoners The problem Write P(x) [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=413&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>My current attitude towards the <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0CB4QFjAA&amp;url=http%3A%2F%2Flesswrong.com%2Flw%2Fhmt%2Ftiling_agents_for_selfmodifying_ai_opfai_2%2F&amp;ei=dBABVPnLBe_EiwLf2YDYCg&amp;usg=AFQjCNGgBiZZgsPHw5VeGWcZQzidV5GQ4Q&amp;sig2=6UgPJXAdgDxV1oJHNpYgyQ&amp;bvm=bv.74115972,d.cGE">Löbian obstacle</a> is &#8220;just live with it.&#8221; This post outlines that view and some of the underlying intuitions.</p>
<p>To show I&#8217;m being a good sport, I&#8217;ll also propose a new reflection principle. But my best guess for the right answer remains &#8220;live with it.&#8221;<span id="more-413"></span></p>
<h1>Gödelian arguments for probabilistic reasoners</h1>
<h2>The problem</h2>
<p>Write P(x) for the probability I assign to a proposition x. Suppose that I am the student of a wise and careful teacher. Write Q(x) for the probabiliy the teacher assigns to x. We&#8217;re going to talk about diagonalization, so we&#8217;ll need to be very careful about the interpretation of Q. For concreteness, let&#8217;s say that Q(x) is the probability that my teacher would have assigned to x at 2:14 PM this Tuesday.</p>
<p>Now suppose this teacher is not just wise, but <em>very</em> wise: I believe that for everything I know, they know it, and for everything that I have thought of, they&#8217;ve thought of it. If I learn that this teacher asssigns x a probability of 70%, then I ignore my other evidence and assign x a probability of 70%.</p>
<p>In symbols, we might formalize this as follows: P( x | Q(x) = p ) = p, or perhaps P( x | Q(x) = p ) ≈ p.</p>
<p>The Gödelian counterexample is the sentence G := Q(G) &lt; 0.5. In words, G asserts &#8220;The wise teacher assigns G a probability less than 0.5 at 2:14 PM on Tuesday.&#8221;</p>
<p>Suppose that my teacher assigns G a probability of 0.499 at 2:14:00 PM on Tuesday. Of course, being a perfect introspector, by 2:14:01 PM they will have revised this probability up to 1. But at 2:14:00 PM there was no way that the teacher could have known <em>exactly</em> what they would believe (for if they had known exactly, then they would necessarily have been exactly wrong).</p>
<p>But I shouldn&#8217;t agree with my teacher in this case. It is crazy to have  P( G | Q(G) = 0.499 ) ≈ 0.499. After all, Q(G) = 0.499 implies G, so we should have P( G | Q(G) = 0.499 ) = 1.</p>
<p>This leads to a problem. How do we formalize my trust for this teacher, if not with the equation P( x | Q(x) = p ) ≈ p?</p>
<h2>Some responses I don&#8217;t like</h2>
<p>There are many standard answers, ported to the probabilistic setting from mathematical logic:</p>
<ul>
<li>Maybe we revise our epistemic theory, so that the teacher can be agnostic rather than assigning a probability to sentences like G. I find this proposal pretty unsatisfying. The teacher believes that G is a coin flip, they don&#8217;t assign it some special epistemic status. They would be happy to take a bet at favorable odds, if their counterparty were equally ignorant. This response is especially unhelpful if we were considering actually <em>delegating</em> a task to the teacher. Reality won&#8217;t generally give them the option to abstain.</li>
<li>Maybe we limit our language to eliminate self-referential sentences. But self-referential sentences are just like any other sentences (if I give you a diophantine equation, can you even tell whether it describes your own beliefs?) so this seems doomed. It is even more doomed if we are specifically interested in reflective reasoning (for example an agent reasoning about its own future behavior, in order to estimate the value of information).</li>
<li>Maybe we weaken our trust, and assert &#8220;I trust the teacher up to <a href="http://intelligence.org/files/DefinabilityTruthDraft.pdf">infinitesimal error</a>&#8221; or a variation on this theme. This approach seems unsuitable when Q is just a Turing machine, however. What does it mean to be &#8220;infinitesimally&#8221; wrong about the output of a Turing machine? Even a sufficiently small constant error ε seems to cause <a href="http://mathoverflow.net/questions/145032/an-approximate-infinite-dimensional-fixed-point-theorem">trouble</a>.</li>
</ul>
<p>It feels to me that we should be trying to reproduce human judgments about these cases (at least as one task; we might have further goals). That is, I actually have some intuitions about how I would update on the judgments of a very wise teacher. So why not try to make a machine do the same thing? I think this is the knee-jerk response of most people who hear about the problem, and one reason it has a hard time getting traction with non-mathematicians.</p>
<p>I don&#8217;t think this is a trivial issue at all; I think that most humans&#8217; intuition about how their brain handles this problem are wrong, and I have a lot of sympathy for people who work on this problem and grow impatient with the knee-jerk response. But I do think that there is a lot to be said for this intuition, of trying to reproduce our actual beliefs about cases like this, at least as a first step before judging them to be unsatisfactory.</p>
<h1>My favored response</h1>
<h2>What humans do</h2>
<p>What do I really believe about the teacher&#8217;s judgment? If I was to be more careful I might say: &#8220;if the teacher knows everything that I do about a situation, <em>then</em> I should trust their judgment in that situation.&#8221;</p>
<p>For example, suppose I flip a coin and look at it. I see it&#8217;s heads, but don&#8217;t show my teacher. I ask my teacher what they think, and they guess that it&#8217;s 50-50. Of course I&#8217;m not going to change my belief back to 50-50, no matter how wise I think they are.</p>
<p>This formulation P( x | Q(x) = p ) ≈ p is baking in the universal a claim that there isn&#8217;t any information I have which the teacher lacks. That sounds OK so far. This is a perfectly legitimate thing to assume, though it&#8217;s almost never going to be satisfied in a meaningful case, including cases to do with self-improvement. But let&#8217;s set that aside for the moment.</p>
<p>The reason this assumption actually leads to trouble is that P <em>is</em> conditioning on a further fact, namely that Q(x) = p, and this further fact is most definitely <em>not</em> something that Q knows&#8211;that&#8217;s the whole point of the Gödelian sentence under consideration, that it exploits Q&#8217;s imperfect self-knowledge. In symbols, P( Q(x) = p | Q(x) = p ) = 1, even if x implies Q(x) ≠ p. But this really doesn&#8217;t seem surprising or objectionable; P is just conditioning on a further fact that wasn&#8217;t available to Q, and we already know that P <em>should</em> disagree with Q when it has extra information.</p>
<p>So the principle I typically recommend is P( x | Q(x) = p ) ≈ p, unless there is a good reason that it shouldn&#8217;t be, such as P knowing that Q&#8217;s judgment doesn&#8217;t reflect some particular datum available to P. This is not a rule of inference which we should bake into P if we want to build an absolutely-Q-trusting agent, this is just a fact which we think ought to be true about P and Q, if in fact P has adequate reason to judge Q to be trustworthy.</p>
<p>This entire discussion also applies if Q = P, or if Q is a modified version of P, or a version that knows some extra facts, or whatever. And the issue is identical: generally the version of me tomorrow knows everything I know, but if I condition on the version of me tomorrow believing X, now I know a whole extra fact (and of course there are typically many other pragmatic reasons that I might know something that I won&#8217;t reliably know tomorrow).</p>
<h2>So now what?</h2>
<p>In the context of self-modification, this doesn&#8217;t resolve the problem of how an agent comes to trust itself. It&#8217;s just telling us what the goal is. That is, we should endeavor to build an agent P such that P( x | P(x) = p ) ≈ p in general, except when it shouldn&#8217;t be.</p>
<p>I admit, that&#8217;s a pretty unsatisfying answer to the question &#8220;What should an agent believe about itself?&#8221; But it is a philosophically <em>acceptable </em> answer to that question. If I ask &#8220;what should an agent believe about a wise and reliable teacher?&#8221; it seems clear that the right answer is messy. There&#8217;s not a simple formula expressing &#8220;is a wise and reliable teacher.&#8221; They should believe the things that lead them to make productive use of the teacher&#8217;s advice, in precisely those cases where the teacher&#8217;s advice is actually useful.</p>
<p>This is also a fine answer from an engineering perspective. An agent reasoning about its own survival, or about building a near-copy of itself, can do expected utility maximization just like the rest of us. It need not have perfect trust in its own reliability, and there need not be any universal argument that trying to survive is a good. We just hope that <em>if</em> survival is actually good, and <em>if</em> the agent we build is a good reasoner, then it will correctly reason and arrive at the correct conclusion. This is exactly the same thing we hope our agents do in other cases.</p>
<p>In fact in cases where the agent justifiably deviates from self-trust, it is right to do so. In the face of an appropriate Gödelian difficulty, a situation in which the agent&#8217;s values are satisfied by doing whatever the agent wouldn&#8217;t, it can be rational for the agent to behave surprisingly.</p>
<p>In addition to be an unsatisfying desiderata, this answer also leads to a new question: <em>how</em> does an agent actually arrive at this kind of self-trust? If we can&#8217;t engineer self-trust in as a rule of inference, then where will it come from?</p>
<p>Unsurprisingly, my answer is: it comes from the same mix of inductive and deductive reasoning that leads to most of our beliefs. If I see that the algorithm I&#8217;m running works well in a bunch of cases, I suspect it will work well in new cases. If I can reason about the algorithm and see what it&#8217;s doing and why it&#8217;s working, I can be more confident. I can write proofs about a system, and those proofs relate my views about some propositions (like axioms) to other propositions (like the system&#8217;s trustworthiness). None of this is mysterious.</p>
<h2>Reasoning about yourself</h2>
<p>This raises yet another question: how can we know that the systems we build will actually successfully carry out this reasoning?</p>
<p>My answer to this is:</p>
<ol>
<li>How do we know it will successfully carry out any reasoning? Probably this is going to be on the same footing. You have to think about why the agent reasons well in general, and try to build agents that reason well. You can&#8217;t answer this question without knowing more about why the agent&#8217;s reasoning works in general, since it&#8217;s exactly the same question.</li>
<li>I don&#8217;t really see how we end up with the situation where an agent is a great general-purpose reasoner, able to make human-level inferences (both inductively and deductively) but completely unable to reason about itself. In fact, the relevant threshold seems to be whether it has human-level ability to reason about the fidelity of its own successors. If it had human-level ability, there is no way that we would be able to say &#8220;This process will reliably lead to catastrophic value drift&#8221; but the agent itself wouldn&#8217;t recognize that! And we can all grant that there <em>do</em> exist modifications for which normal human-level reasoning gives adequate confidence, for example no-ops, observations, and minimal capacity improvements. So an agent with merely human-level reasoning about this issue wouldn&#8217;t be forced to do something low-fidelity, it would just have to be cautious, in the same kind of ways that we might be cautious.</li>
<li>We can, in advance, try to study the kind of reflective reasoning that an agent would be carrying out in these cases. I don&#8217;t know if this is at the top of the priority list of things to do, but it is certainly an interesting exercise that I would expect to substantially improve our understanding of the situation.</li>
</ol>
<p>[3] requires more elaboration, because I think this really is the answer to &#8220;what now?&#8221; for people who are interested in reflection <em>per se </em>or who prefer it to other problems for various pragmatic reasons.</p>
<p>I see an analogy with the kind of reflection that an automated mathematician is capable of. That is, <em>if</em> an automated mathematician has a truth predicate which it can apply to its axioms, <em>and</em> it believes its axioms are true, <em>then</em> it infers that everything it proves is also true. And that seems great. This allows the automated mathematician to develop the abstraction of truth, and the truth of the underlying axioms, by inductive reasoning about the world, and then to infer from that its own correctness (and moreover to reason about what changes to its own algorithm would be productive). It also gives us a lot of confidence that &#8220;proof&#8221; is a useful notion.</p>
<p>In the case of probabilistic reasoning we don&#8217;t really have analogous arguments, which an agent might use to better understand its own reasoning. We could imagine finding a prior which &#8220;thought itself reasonable,&#8221; such that an agent using that prior could look at its own reasoning process and conclude something about its own reliability. I don&#8217;t think that this would be done by having a prior P with &#8220;P is reasonable&#8221; baked in; instead it would arise in the same way that an agent that trusts proofs will naturally realize that its own deductive process is reasonable, because there <em>is</em> an objective sense in which proofs preserve truth. I can see various lines of attack on this problem, though they also seem quite difficult.</p>
<p>I think that this is a realistic objective, and one that motivated <a href="http://intelligence.org/files/Non-Omniscience.pdf">some of my own work</a> on probabilistic reasoning about mathematics. Certainly a success would be interesting. It also seems conceivable that sorting through these questions will involve significant shifts in our understanding, or that this kind of reflective self-certification will be a useful tool in helping us understand what kind of belief formation process is reasonable in the probabilistic setting (in the same way that the &#8220;proof&#8221; that proofs preserve truth is an important part of the modern view that proof is a useful way to reason).</p>
<p>At the same time, it&#8217;s totally conceivable that there just won&#8217;t be a clear story. There isn&#8217;t any particular reason that there needs to be a further story, and it doesn&#8217;t seem like it would be too terrifying if there weren&#8217;t. The situation seems similar to the one humans have been in for most of history: we had no particular reason to expect there to be a nice framework like probability for reasoning under uncertainty, but there was. We knew no particular reason for there to exist this beautiful complete and sound deduction procedure for first order logic, or even for there to exist a system as awesome as first order logic, but there was. I don&#8217;t think this is a law of nature, but I do think that the historical record suggests that it&#8217;s often worthwhile to be optimistic and try to come up with clean formal frameworks.</p>
<h2>A note on non-monotonic reasoning</h2>
<p>One reason that the intuitive response outlined above is unsatisfying to many people is that it has the form &#8220;You should think X unless there is a reason to think not X.&#8221; Of course this is an exceptionally common pattern of reasoning in ordinary life. That said, it makes it very difficult to prove that your system will reason in the intended way. After all, what if there <em>was</em> a good reason for not X, and you just didn&#8217;t notice?</p>
<p>Proofs, on the other hand, have this desirable characteristic that if you prove X, then you&#8217;ve proved X for good, and under plausible conditions you shouldn&#8217;t prove any false X. You can&#8217;t have overlooked anything, you won&#8217;t revise the view later. This property is closely related to what is called <em>monotonicity</em>. As you learn more facts, you are able to prove more things.</p>
<p>I can see the attraction of this property. But it&#8217;s obviously unattainable for almost all inferences of interest, and I haven&#8217;t really seen good arguments for demanding it in this case. Indeed, we <em>know</em> that there are cases where if you think a bit longer you will realize that you shouldn&#8217;t trust yourself in some particular future case. That&#8217;s just how it goes; we have to live with it.</p>
<p>If you really want to prove theorems about the good behavior of a proposed system, rather than using assumptions like &#8220;Assuming the agent doesn&#8217;t believe anything false,&#8221; I think you should be using assumptions like &#8220;Assuming that we accept the agent&#8217;s assessments of probabilities.&#8221;</p>
<p>Amusingly, articulating this assumption formally runs into exactly the same Gödelian obstructions. My response? Probably we just have to live with it.</p>
<h1>A reflection principle</h1>
<p>Despite all that, I <em>do</em> think it would be convenient if we could have a clean, formal, statement of what it means to regard someone&#8217;s judgment as perfectly trustworthy (including assuming that &#8220;they know everything I do&#8221;). So let&#8217;s say we want to write down a formal rendition of &#8220;P trusts Q completely,&#8221; a replacement for the inconsistent P( x | Q(x) = p ) = p.</p>
<p>It seems like in order to construct such a definition, we need to have P condition on Q&#8217;s probability assessments. So if Q isn&#8217;t good at introspection the whole project seems sunk, since by conditioning on Q&#8217;s probability assessments P will actually be learning something that Q doesn&#8217;t know. This is a further restriction on Q, but I&#8217;m not too troubled by that; I think it&#8217;s reasonable for agents to have a pretty accurate view of their own internal state.</p>
<p>So let&#8217;s require that if Q(x) = p, then Q( Q(x) ≈ p ) ≈ 1. That is, let&#8217;s assume that Q knows its own probability assessments up to some small error. It&#8217;s inconsistent to assume that Q knows its own assessments exactly, but knowing its own assessments approximately doesn&#8217;t seem inherently problematic.</p>
<p>OK, so now we&#8217;d like to have P condition on Q&#8217;s judgments without in the process learning something that Q doesn&#8217;t know. The obvious approach is to consider P( x | Q(x) ≈ p ), but this turns out to be problematic whenever Q(x) ≈ p is replaced by <em>any</em> predicate relating Q(x) and p. The basic problem is that P( Q(x) ≈ p |  Q(x) ≈ p) = 1, regardless of the interpretation of ≈.</p>
<p>Once we&#8217;re thinking about things in this way, a natural approach is to have P condition on a noisy observation of Q(x). That is, let E be a new error term which is close to 0 and which doesn&#8217;t appear in the expression x. For example, let η be a symbol representing a noise variable that is uniform in [-ε, +ε] and which doesn&#8217;t appear in x, and let p&#8217; be uniform in [p &#8211; ε, p +ε]. Then we should have E[P(x | Q(x) + η = p&#8217;)] ≈ p, where the expectation is taken over p&#8217;. In fact we can define the distribution P( • | Q(x) + η = p&#8217; ) using Bayes&#8217; rule without even introducing the symbol η into the language.</p>
<p>This is a tool for capturing the intuition that we should cause P to condition on something, but simultaneously shouldn&#8217;t tell it anything that Q doesn&#8217;t know about x. P has indeed learned something that Q didn&#8217;t know, namely about the error term E. But since Q already mostly knew its own beliefs, the extra data P learned was (mostly) unrelated to the question x.</p>
<p>Of course such an observation will still shift P&#8217;s beliefs about x: it now knows (a bit) more about Q(x) than is reflected in Q(x) itself. But now this change should be reflected in a small shift in P&#8217;s beliefs, because Q already <em>almost</em> knew the value of Q(x). We can now formulate an approximate reflection principle, parametrized by a small ε &gt; 0:</p>
<blockquote><p>For every x, E[P(x | Q(x) + η = p&#8217;)]∈ [p-ε, p+ε], where η is a symbol representing a U(-ε, +ε) random variable that doesn&#8217;t appear in x and p&#8217; is distributed according to U(p-ε , p+ε).</p></blockquote>
<p>I think this principle dodges the immediate Gödelian problems and is consistent, by the same argument as given in <a href="http://intelligence.org/files/DefinabilityTruthDraft.pdf">this preprint</a>. I don&#8217;t know if it satisfies various stronger criteria, like being satisfied by an actual pair of mathematical definitions P and Q.</p>
<p>There are various other subtleties in the definition. For example, note that in order for this to be satisfied, if G = Q(G) &lt; 0.5 then Q(G) ∈ [0.5 &#8211; ε<sup>2</sup>, 0.5 + ε<sup>2</sup>], which is perhaps not what you&#8217;d expect from the definitions. Nothing looks like an obvious problem, but on the whole it seems unclear whether all the pieces will fit together. But the overall idea seems to me like an attractive one.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/413/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/413/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=413&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/08/30/confronting-godelian-difficulties-reprise/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Challenges for extrapolation</title>
		<link>https://ordinaryideas.wordpress.com/2014/08/27/challenges-for-extrapolation/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/08/27/challenges-for-extrapolation/#comments</comments>
		<pubDate>Wed, 27 Aug 2014 03:57:51 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/2014/08/27/challenges-for-extrapolation/</guid>
		<description><![CDATA[My current preferred formalization of extrapolation of an agent&#8217;s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is &#8220;good enough&#8221; [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=408&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>My <a href="https://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/">current preferred formalization</a> of extrapolation of an agent&#8217;s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is &#8220;good enough&#8221; for the intended use case.</p>
<p>One crisp difficulty is the following: this approach relies completely on the agent <em>wanting</em> you to know its extrapolated preferences.</p>
<p><span id="more-408"></span></p>
<h3>Challenge #1</h3>
<p>If the agent didn&#8217;t&#8211;for example, if it expected that in general it would be <em>worse off</em> if other agents knew its extrapolated preferences&#8211;then the proposal breaks down entirely. In this case the agent will simply decline to participate in the hypothetical thought experiment, and the answer to the question &#8220;What would they decide if they had the opportunity to reflect at length?&#8221; is &#8220;nothing.&#8221;</p>
<p>As far as I know, this is a problem for all available formal specifications of an agent&#8217;s enlightened judgment, or idealized beliefs, or extrapolated preferences, or whichever. I now think that if I had a formal answer to this question I would understand the issue substantially better. So this is my first challenge: <strong>how can you define an extrapolation/idealization of an agent&#8217;s preferences, if it doesn&#8217;t want you to?</strong></p>
<p>The point of this question is not that extrapolating an agent&#8217;s preferences against its interests seems useful; rather, it&#8217;s that there seem to be various potential problems of using the kind of proposal I advocate, and they might be broadly classified as &#8220;motivational problems.&#8221; So to make the difficulties more clear, it seems useful to consider this extreme case.</p>
<p>There are various ad hoc approaches around this difficulty, but they seem very problematic. For example, we could consider the agent&#8217;s behavior in a counterfactual that could actually arise, where it had some incentive to show its true colors because it would actually be making meaningful choices. Or we could consider the agent&#8217;s behavior in a counterfactual where it didn&#8217;t understand what was going on, or where it incorrectly believed that its output in the thought experiment would be used in a way that it would be happy with rather than one it would be unhappy with. But neither of these seems at all compatible with the agent learning a lot, both seem difficult to define due to the tricky counterfactuals (though maybe that&#8217;s a reasonable problem to attack if the other issues aren&#8217;t issues), and both seem very fragile and likely to break for unexpected reasons.</p>
<h3>Challenge #2</h3>
<p>There is an alternative view, which says that the way to formalize &#8220;what people want&#8221; or &#8220;what they believe&#8221; is to have a better understanding of how the human brain works / how decision theory works, and then to extract &#8220;what someone wants&#8221; directly from their behavior. This motivates further investigation of just what the brain is doing, or at least optimism that the philosophical problem will be in better shape once these empirical investigations have gotten underway. I see the charm in this proposal, but I have a fairly serious gripe: <em>I can&#8217;t imagine any realistic outcome of the empirical investigation that answered this question</em>.</p>
<p>I can imagine <em>outcomes</em> of the empirical investigation such that the philosophical problem goes away, it&#8217;s just that these don&#8217;t seem realistic. For example, we might learn that the human brain is <em>actually</em> just a bunch of heuristics implementing a very simple algorithm, of the form &#8220;Build model of the world, decide what actions lead to outcome X, and then perform those actions.&#8221; But this seems fairly implausible based on my understanding of the brain (I think most experts agree), and beyond that would actually be deeply philosophically surprising.</p>
<p>What I expect to see in fact is that the human brain is a collection of many interacting subsystems, under some selection pressure for improved reproductive fitness (which means locally that the brain and each of its parts is under selection pressure to fulfill a wide range of instrumental goals, to varying degrees), though even that picture is not even very good. And then&#8230; what? I don&#8217;t see how we even extract a picture of what beliefs are &#8220;correct&#8221; on this picture, much less of what we &#8220;really value.&#8221;</p>
<p>Is there an intermediate picture, on which this understanding of the brain / decision theory will lead to a clear picture of human values? This is my second challenge: <strong>what could we plausibly learn about the brain and decision theory such that we would be able to define value extrapolation in a more direct way?</strong></p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/408/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/408/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=408&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/08/27/challenges-for-extrapolation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Specifying &#8220;enlightened judgment&#8221; precisely (reprise)</title>
		<link>https://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/#comments</comments>
		<pubDate>Wed, 27 Aug 2014 01:28:11 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=402</guid>
		<description><![CDATA[Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do &#8220;upon reflection.&#8221; That is, in general I can use this model to define my current best guess as to the answer, but I might also be [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=402&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do &#8220;upon reflection.&#8221; That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my &#8220;enlightened judgment,&#8221; if I knew all of the facts and considered all of the arguments and were <a href="http://intelligence.org/files/CEV.pdf">more the person I wish I were</a> and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?</p>
<p><span id="more-402"></span></p>
<p>This seems particularly important in the case of really hard questions, where even posing the question to me would be essentially impossible. For example, if you wanted to figure out which of two states of the universe I would prefer, it&#8217;s not clear how you could even articulate a question to ask me, or what you could show me so that I would be sufficiently informed to render a judgment. Yet it feels like there is some important sense in which I do have preferences between states of the universe, even those that are too messy and complicated for me to comprehend directly.</p>
<p>As in my last post, I&#8217;d like to add the disclaimer that I think it is unlikely that anything resembling this proposal will ever be used. Instead, I see the value as mostly coming from helping to show us where the largest difficulties might be hiding (in the plausible but still not too likely case where we need to resolve problems similar to this one).</p>
<p>I know of two broad approaches to defining a human&#8217;s &#8220;enlightened judgment:&#8221;</p>
<ol>
<li>Understand and define what it is that humans are doing when they decide, and when they err. From this directly define &#8220;what a person wants&#8221; or &#8220;what they would decide if they did not err&#8221; or some more complicated dynamic that depends on these and other similar notions.</li>
<li>Describe &#8220;ideal&#8221; conditions for reflection under which we think that a person would actually reach their enlightened judgment.</li>
</ol>
<p>Both of these approaches seem to me to have serious problems, especially as a philosophical account of what we mean when we talk about &#8220;enlightened judgment.&#8221; Fortunately I am not particularly interested in the philosophical question, and am more interested in pragmatic issues.</p>
<p>The main reason I care about enlightened judgment is that <a href="https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/">it would be nice</a> if society were be able to build a machine which can understand and behave according to our enlightened preferences about what it would be good for the machine to do. So when deciding between [1] and [2], my question is: which of these approaches would be better for telling a machine what to do?</p>
<p>At the moment, completing project [1] looks like it would be quite difficult; the available routes forward look like fairly long shots, and I&#8217;m far from convinced that the project is even philosophically coherent or should be expected to have a clean or satisfying objective. Project [2] has the virtue of being potentially very straightforward, though it also runs a much greater risk of failing to mean what we want it to mean.</p>
<p>In this post I want to say more about why I consider approach [2] relatively plausible. This means a lot of getting into the details of what those ideal conditions would look like, and what people would do under them. I won&#8217;t really say much about the philosophical issues at stake here; instead this solution will largely lean on claims like &#8220;Whatever we might be able to do as part of project [1], if the ideal conditions are good enough then these could also be done as one step in project [2].&#8221;</p>
<p>I think the largest remaining concern is about the <em>motivations</em> of the person reflecting. For example, if I try to use procedure [2] to reason about the enlightened judgment of someone who doesn&#8217;t care about the underlying issue (or moreover about what I think about what they think about the underlying issue) I would run into difficulties, since they wouldn&#8217;t actually use the opportunity to reflect. In fact it seems like there are even more severe difficulties along these lines, which at a minimum substantially restrict the set of conditions under which this procedure can be applied. My current best guess is that they aren&#8217;t deal-breakers, but they do seem worth thinking about more. Three issues in this space are discussed at the very end of the post.</p>
<h3>Roadmap</h3>
<p>In <strong>section</strong> <strong>1</strong> I specify the problem statement in more detail.</p>
<p>In <strong>section 2</strong> I lay out the formal proposal.</p>
<p>The proposal is based on an interactive loop between a person and a powerful computing environment. In <strong>section</strong> <strong>3</strong> I describe the kinds of things which I expect the person to try to do in that computing environment. These aren&#8217;t technically part of the proposal, but (a) the person whose judgment is being extrapolated in this way will need to have some ability and inclination to implement this project in order for the proposal to be sensible, (b) they are important for evaluating the extent to which the proposal is sensible.</p>
<p>In <strong>section 4</strong> I sample some techniques that I think the person could use, and general reasons to be optimistic that they may be able to succeed at the goals outlined in section 3.</p>
<p>In <strong>section 5</strong> I will evaluate some objections to this proposal.</p>
<h1>1. Problem statement</h1>
<p>I&#8217;ll <a href="https://ordinaryideas.wordpress.com/2014/08/24/specifying-a-human-precisely-reprise/">take as given</a> some process P that implements the input-output behavior of a given person (or group of people) inhabiting a comfortable environment, interacting with a computer terminal, over a period of (say) a month. The problem statement is then to define a process, potentially calling P as a subroutine, which outputs the considered judgment of that person (or group of people) about a given question Q.</p>
<p>It&#8217;s worth emphasizing that we are just looking for a <em>definition</em>, not an <em>implementation</em>. So none of the processes we describe need to be tractable, and they won&#8217;t ever get run. They will merely be reasoned about.</p>
<p>I&#8217;ll also imagine that the process P has appropriate access to facts about the world; for example, it might have the entire contents of the internet, and copious video recordings of our world.</p>
<p> </p>
<p>The question Q is not in fact very important. Almost all of the work will be done by the transformation from P into a much more powerful process. I imagine the process P reflecting for a very, very long time prior to even looking at the question Q.</p>
<p>On this perspective, for example, it makes basically no difference what format Q is written in. We might even be able to skip providing the question Q at all, and leave it to the program to infer the context from which it is being invoked, and what answer would be appropriate to provide in that context.</p>
<h1>2. The definition</h1>
<h3>Basic setup</h3>
<p>Our program P will have the form of running interactive loop between P and a particular computing environment E. For example, if E were an interactive python session, we could define the result of P interacting with an interactive Python interpreter by describing a loop that alternatives by advancing P by a millisecond and advancing E by one millisecond.</p>
<p>We will then output the result of that interaction. So the problem comes down to designing the computing environment E with which we want P to interact. Needless to say, the interaction between P and a normal interactive Python interpreter would not result in a particularly &#8220;enlightened&#8221; judgment.</p>
<p>Because we are merely interested in defining a process, we can talk about a computing environment with unbounded computational power. So at a minimum, we could take E to be an interactive shell with infinite computing power.</p>
<p>It&#8217;s not clear that this really improves the situation much (also it&#8217;s not yet formally defined; &#8220;a computer with infinite computing power&#8221; isn&#8217;t quite a thing). But I think it opens up some possibilities. In the next sections we will define some improvements to this environment, and in the following section I will define it formally.</p>
<p>This will complete the formal definition of E, and with it the entirety of our formal definition of enlightened judgment.</p>
<h3>Using infinite computation</h3>
<p>We can make the situation a little better by providing a shell equipped with some helper functions.</p>
<ol>
<li>An implementation of P itself.</li>
<li>An implementation of E itself.</li>
<li>Various utilities designed to allow different instantiations of E and P to interact with each other, to save the states of implementations of E and P, etc.</li>
<li>Whatever else we can think of. Utilities to inspect the specifications of E and P, to easily write modified versions, to inspect the available data, and generally to save labor and minimize the risk of error.</li>
</ol>
<p>We&#8217;ll take E to be such a shell, and let P<sup>+</sup> be system consisting of P interacting with E.</p>
<p>Using these utilities, P<sup>+</sup> can begin to implement some fairly powerful operations. For example, P<sup>+</sup> could set up 10<sup>100 </sup>copies of itself. Each of them could be sent a message describing a random approach to solving a particular problem. They could then participate in a tournament, where pairs of copies of P<sup>+</sup> debated between each other which came up with a better solution. The best solution could then be selected.</p>
<p>Also note that using these utilities, P<sup>+</sup> can implement new useful utilities, and then pass off control to a new instantiation of itself which has access to an enlarged set of utilities. So hopefully there won&#8217;t be much sensitivity to the detailed choices.</p>
<p>(I certainly wouldn&#8217;t advocate this kind of brute-force approach, for reasons that will hopefully be clear by the end of the post, but it gives some indication of what you could do using these utilities and infinite computing power.)</p>
<h3>Defining infinite computing</h3>
<p>There are still some details missing from this specification. For example, what is this computer with infinite computing power actually doing? How can it reason about the output of new copies of itself? Aren&#8217;t there problems with diagonalization, no matter how you define it?</p>
<p>A first pass definition would be a machine which can run arbitrarily long computations, but which hangs if it goes into an infinite loop. Such a machine can call itself, and is in many ways quite satisfactory. However, if the machine actually crashes it would do a terrible job of reflecting the enlightened judgment of P, and we should endeavor to make this outcome as improbable as possible.</p>
<p>I should say at the outset: I don&#8217;t think these problems are serious ones at all, I think they are essentially annoying technical details. I think the standard for resolving them is: resolve them sufficiently well that we can completely ignore them.</p>
<p>One way to prevent this outcome is to change the machine so that it throws an exception if it enters an infinite loop. Unfortunately this attempted definition is impossible due to a standard problem with diagonalization: what is the output of the program f which calls itself, if it sees an infinite loop returns 0, and if it sees the output 0 enters an infinite loop?</p>
<p>A compromise solution is to allow &#8220;protected&#8221; calls, which throw an error if they reach an infinite loop, but to ensure that the protected calls can&#8217;t form a cycle. We can then require that each function called from the interactive shell E is protected, making it impossible for the overall system to throw an error. The tradeoff is that we now need to bound the number of times that P can delegate authority to a copy of itself (though we need not bound the number of copies that can be spun up in parallel).</p>
<p>One way of doing this is to simply label each such call with an integer, and to require these integers to decrease. This in turn requires labeling each invocation of P<sup>+</sup> with an integer, such that the integers do not increase. For example, we might use P<sup>+</sup>[100], which isn&#8217;t able to invoke further copies of P<sup>+</sup>[100] but is allowed to invoke copies of P<sup>+</sup>[99].</p>
<p>Since we never really cared about infinities, but were happy enough with astronomically large finite numbers, we could just use P<sup>+</sup>[2<sup>10000000</sup>], and I would be shocked if the distinction ever mattered. To make things even safer, we can index E&#8217;s by arbitrary ordinals. So for example, we can imagine P<sup>+</sup>[ω] which can invoke copies of P<sup>+</sup>[n] for any integer n (e.g. it can write a loop which instantiates a copy of P<sup>+</sup>[n] in the nth iteration).</p>
<p>And we can imagine P<sup>+</sup>[ω+1], which can invoke P<sup>+</sup>[ω], and we can go on in this way to P<sup>+</sup>[2ω] which can invoke a P<sup>+</sup>[ω+n]; to P<sup>+</sup>[ω<sup>2</sup>], P<sup>+</sup>[ω<sup>ω</sup>], and much larger orderings. For our purposes, I think this capability is completely sufficient and indeed even overkill.</p>
<p>From now on I am just going to go back to pretending that we have access to unlimited computational resources.</p>
<p>Of course we should also work with an idealized machine possessing infinite memory etc., and a sufficiently high-level language that we can run it on this idealized machine without running into technical difficulties.</p>
<h1>3. Goals</h1>
<p>It remains to speculate about what P might actually do, placed in this &#8220;ideal&#8221; environment. Indeed, it seems that P&#8217;s state of ignorance is roughly as severe as our own, and it&#8217;s not clear that we have offered any plausible route towards defining an enlightened judgment at all. In this section I will describe some of the projects and goals that I think P could pursue, each of which I think would leave it in a substantively better position than we are currently in.</p>
<h3>Infrastructure improvements</h3>
<p>One simple project for P<sup>+</sup> is improving its own infrastructure: building a better programming language with more useful tools, designing a nicer environment for itself to live in, and so on. P<sup>+</sup> can target changes that are extremely unlikely to change its enlightened judgment in a substantive or morally relevant way, but which make it more and more able to quickly and efficiently communicate its intentions formally, write new programs, explore new possibilities, and so on.</p>
<p>These improvements can build up across a massive number of generations. If this were the safest way forward, P<sup>+</sup> could avoid doing anything else until its ability to quickly write reliable code and visualize the behavior of programs was radically beyond anything we are familiar with today.</p>
<p>These might also include improvements in P itself. For example, if the program P only captures a few months of time before needing to spin off another copy, P<sup>+</sup> might work towards a version of P which persisted for years or longer (though this would naturally require more improvements in the environment, if we&#8217;d like to avoid insanity). P<sup>+</sup> could also experiment with more approaches for sharing information between different instances of itself.</p>
<h3>Simulations</h3>
<p>P<sup>+</sup> is equipped with some amount of data about the outside world. The sparsity of this data presents a challenge for P<sup>+</sup> contributing to progress in the empirical sciences in a normal way. However, P<sup>+</sup> has many advantages (see section [4]), and if it were possible to overcome the data sparsity problem it would likely be able to achieve radically more progress than we are familiar with.</p>
<p>For some areas of science, for example theoretical physics (or the psychology or group dynamics of copies of P, economics of societies populated by copies of P, etc.), P<sup>+</sup> might be able to make huge amounts of progress using the data has available (over the course of billions of subjective years).</p>
<p>In other domains, P<sup>+</sup> can circumvent data sparsity problems by investigating simulations. Because P<sup>+</sup> has access to unbounded computing resources, it can pursue simulations at a greater scale, higher degree of fidelity, and under a greater range of parameters than we would normally consider possible.</p>
<p>For example, P<sup>+</sup> can consider vast numbers of possible universes governed by different laws of physics, and invoke copies of itself to inspect each of them, at many points in time and in many different ways. A sufficiently exhaustive search would eventually turn up worlds that resembled P<sup>+</sup>&#8216;s home in arbitrary detail, and a copy of P<sup>+</sup> assigned to inspecting that world from an appropriate vantage point would be able to notice this. This could give P<sup>+</sup> access to a simulated copy of its home, together with infinite time to investigate, understand, and modify that simulation. It could also give P<sup>+</sup> access to many additional individuals and groups, which it could consult or take on board as peers.</p>
<p>Perhaps most relevantly, if any civilizations anywhere in the space of possible civilizations resolve the problems that P<sup>+</sup> is interested in, it could inspect their solutions, consider them at infinite length, and adopt them if they looked like adequate solutions. This suggests that at the point where P<sup>+</sup> is equipped to explore many simulations of different physical laws, it may already be in a position to &#8220;win.&#8221;</p>
<h3>Human enhancement</h3>
<p>Especially if P<sup>+</sup> were able to perform biology in simulation, but even if it were only able to inspect and work with the code defining P itself, P<sup>+</sup> could aim to improve its own capabilities. Over a long time period, these small changes could amount to radical increases in capabilities, while still preserving the spirit of P&#8217;s original intentions.</p>
<p>For example, P<sup>+</sup> could improve its own short-term or long-term memory, its ability to think carefully or quickly, its lifespan. P<sup>+</sup> could simply acquire more knowledge; it could improve its ability to represent that knowledge well, to quickly train itself in new areas; it could build up more strategies from problem-solving, research, and planning, and frameworks for developing and reasoning about knowledge.</p>
<h3>Psychology and philosophy</h3>
<p>The same psychology and philosophy projects that we might pursue in order to define enlightened judgment can also be pursued by P<sup>+</sup>. If it were to succeed at such a project it might simply be able to directly define its enlightened judgment and then use its computational resources to evaluate that. If not, these projects could still inform its approach to the other projects mentioned.</p>
<p>P<sup>+</sup> could work to become more aware of its own motives, and work to enshrine them in ways that reduced the risk of value drift while pursuing other projects. And so on.</p>
<h3>Artificial intelligence</h3>
<p>The same artificial intelligence projects that humans might pursue in order to build smarter agents that shared their values could also be pursued by P<sup>+</sup>. Even if it were impossible to build agents that shared P&#8217;s values when operating autonomously, P<sup>+</sup> would have access to a very broad range of approaches which leveraged the availability of infinitely many copies of P<sup>+</sup> (whether as overseers of other agents, participants in an institution, or components of an AI).</p>
<h1>4. Techniques and advantages</h1>
<p>When pursuing any of the projects above, P<sup>+</sup> would have a number of very significant and distinctive advantages, suggesting it is more likely to succeed (and to succeed in a way more faithful to its goals) than any real group of humans. Most of these advantages would be available from a very early stage.</p>
<h3>Taking it slow</h3>
<p>Perhaps most importantly, P<sup>+</sup> has the luxury of nearly unlimited time. In the real world, there are competitive pressures that lead us to work quickly, sometimes risking a probability of error in exchange for faster progress. We also face risks of negative outcomes which force us to move more quickly than we might otherwise like. Even if P<sup>+</sup> had no other advantages, I think this one would probably be significant enough to carry the argument. There are a large number of advantages from proceeding slowly if your goal is maximizing probability(success), including the ability to quickly back off from potential problems, to look at something from many perspectives, to take very small and safe steps, to develop as much machinery as is helpful to resolve the problem, and so on.</p>
<h3>A million pairs of eyes</h3>
<p>P<sup>+</sup> also has the luxury of instantiating an almost unlimited number of critics of any proposal that it is considering implementing. That is, it can generate a million copies of itself and ask each one to look in a certain direction for flaws, until finally being presented with the most compelling issues that any copy identified. It can also look over (or invoke copies to look over) processes at a very high level of detail, for example stepping through code line by line, or watching another copy&#8217;s judgment unfold at 0.001x speed. When arbitrating a dispute between two copies of P<sup>+</sup> that came to different judgments, the arbiter can invoke a thousand times more resources than either copy and make manipulation very difficult or impossible. If motivation allows this process can be taken to comical lengths, for example developing an entire branch of mathematics or an entire civilization to characterize the behavior of a particular program with extremely high confidence.</p>
<h3>Recursion and tail recursion</h3>
<p>A very simple but important fact for P<sup>+</sup>&#8216;s ability to get things done in limited time is that P<sup>+</sup> can recurse in two ways. It can spin off copies to work on subproblems. But if the end of its time approaches and the overall problem has not yielded, P<sup>+</sup> can also spin up copies to work on the original problem, now endowed with some additional info that P<sup>+</sup> has acquired, the ability to have discussions with copies of the current version of P<sup>+</sup>, and so on. In each of its lifetimes, P<sup>+</sup> only needs to get incrementally closer to its goal.</p>
<h3>Reviewing outputs</h3>
<p>P<sup>+</sup> is free to run whatever process it wants, and then to apply final sanity-checks to the output, ensuring that they are reasonable and avoiding mistakes.</p>
<h3>Infinite computing</h3>
<p>Having the <em>option</em> of using infinite computing seems likely to be a significant aid for P<sup>+</sup>, though it&#8217;s not as obviously and materially helpful as the other consequences mentioned. For example, if we think that understanding psychology is a useful step towards defining enlightened judgment, and that unlimited computational resources or unlimited numbers of simulated people could be used to improve psychology, then P<sup>+</sup> would be at an advantage. If we thought it was possible to design an AI that shared our values and could make better judgments, but that it would only make such good judgments by being astronomically slow, then that would be an option for P<sup>+</sup> but not for us. Moreover, P<sup>+</sup> would actually have the ability to do experiments with such AI&#8217;s.</p>
<h1>5. Objections</h1>
<p>In general when evaluating a proposal like this, it seems instructive to consider the tradeoff between making a more careful or complicated definition ourselves vs. using a simple definition with an extra level of indirection (i.e. delegating the problem). The natural risks of using the extra level of indirection are that (a) you might mess up the indirection itself, (b) the people you delegate to might mess up. Delegating less labor means taking more chances to mess up yourself, and leaving fewer to the people you are delegating to.</p>
<p>In this case we are delegating to the process P<sup>+</sup>, which seems much better equipped to do a good job than we are, and whose time is much less valuable. So my inclination is to delegate as much as possible while we are at it. Even implementing relatively basic functionality seems to go against our comparative advantage: it increases the time required by the project, and our probability of messing up is substantially larger than P<sup>+</sup>&#8216;s. (That is, we should stop improving P<sup>+</sup>&#8216;s environment once we think that P<sup>+</sup> can do it more reliably than we can.)</p>
<p>That said, there are a few natural concerns:</p>
<ol>
<li>We might mess up the first level of indirection; this is adding an extra layer of complexity, and therefore unnecessarily increasing the probability of a mistake somewhere.</li>
<li>We have some capabilities that P<sup>+</sup> lacks, in particular we have access to a research community with many diverse humans rather than a small group of people. We also have good empirical evidence that the research community can make progress, whereas we are less sure about this group.</li>
<li>P<sup>+</sup> is being asked to solve a much harder problem, e.g. since it probably has to solve physics and simulate our universe as a preliminary step.</li>
<li>Lots of things have to go right for this to work. That just seems generically improbable.</li>
<li>Although P<sup>+</sup> is better equipped than we are to make progress, it may not be so inclined.</li>
</ol>
<p>Of these, I consider the last the most serious objection. My overall view is that unless we had a clearer sense of the issues in the last bullet point we should be a bit concerned about implementing a procedure like this, and we would at a minimum need to take a lot of care in choosing people. My expectation is that these difficulties afflict most procedures for defining enlightened judgments.</p>
<h3>Messing up the indirection</h3>
<p>The relevant comparison we face is:</p>
<ul>
<li>Our ability to correctly specify P<sup>+</sup>, and P<sup>+</sup>&#8216;s ability to specify P&#8217;s enlightened judgment.</li>
<li>Our ability to correctly specify P&#8217;s enlightened judgment.</li>
</ul>
<p>I think that P<sup>+</sup> is substantially better able to specify P&#8217;s enlightened judgment, so we are gaining on that term; but how much do we lose by needing first to correctly specify P<sup>+</sup>?</p>
<p>My own best guess is that we don&#8217;t lose that much. Specifying P may be complicated but is relatively easy to test and is basically a common difficulty of any sensible approach (though in many settings &#8220;specify&#8221; may not involve so much precision, or may be able to take a very different approach). Specifying E appears to be extremely easy. Moreover, most ways in which the system could fail (which would result in the enlightened judgment being incoherent or nonsensical) would be easily detectable. The only failure cases that may be subtle to detect are those in which the process is &#8220;hijacked&#8221; by a goal-oriented agent, who is motivated not only to manipulate our definition of P&#8217;s enlightened judgment, but also to conceal this hijacking.</p>
<p>I think this is plausible as a result of messing up the indirection (rather than P<sup>+</sup> messing up), but not too likely. The main way it could happen is by the specification of P itself being hijacked, a possibility I have discussed in the past and which I think can probably be avoided if it is kept in mind.</p>
<h3>We have capabilities P<sup>+</sup> lacks</h3>
<p>This complaint becomes more severe the more minimal the specification of P becomes. If the specification of P only persists for an hour, is a little bit crazier than the original, and lives in a psychologically straining environment, then there may be many respects in which we are a more informed decision-maker than P<sup>+</sup>. If the specification of P involves a reasonable range of researchers in a comfortable environment, then the gap becomes smaller but still exists. In between, the risks are in between. I don&#8217;t know how hard specifying more realistic and more varied versions of P would be, though my guess would be &#8220;not that much harder than specifying any P at all.&#8221;</p>
<p>Over the very long term I suspect issues like the psychological stability and diversity of perspectives would have substantial value, e.g. for solving the kind of philosophical problems that P<sup>+</sup> might want to solve. However, the more relevant question is instead how relevant these capabilities are for P<sup>+</sup> in the first part of its task, where it is attempting to improve on the original specification of P.</p>
<p>One approach to this improvement involves the brute-force search through the space of possible worlds to find a world like ours, identifying people like us within it, and then expanding P<sup>+</sup> to consult those people for advice and incorporate them into the decision-making process. This looks to me like much more of an engineering problem than a philosophical problem, on which the value of a diversity of perspectives is much reduced (and can apparently be largely replaced with more effort by P itself). And if this problem is successfully resolved, then our advantage from P<sup>+</sup> is largely eroded.</p>
<h3>P<sup>+</sup> is being asked to solve a really hard problem</h3>
<p>Even once it&#8217;s done all of the low-hanging improvement and enhancing and so on, it may seem that the problem P<sup>+</sup> faces is just way harder than the problems we face. The main way in which this problem may be more challenging is that it might require finding a simulation of a world like our own (a problem which we would consider fantastically difficult to actually solve here in the real world).</p>
<p>I think this is mostly a failure of imagining the impact of an astronomical number of people or having a billion years. Theoretical physics seems like a relatively easy problem to solve if you are allowed to do human-guided brute force searches over possible hypotheses. Searching through all of the plausible physical theories seems like a relatively easy problem to solve if you can simulate every program of less than a million lines of code, and have someone look by hand through every one of them searching for signs of life.</p>
<h3>P<sup>+</sup> has to do lots of things right</h3>
<p>In some sense the success of this proposal rests on a substantial conjunction, with us successfully specifying P<sup>+</sup>, P<sup>+</sup> successfully carrying out a number of projects, defining successors who successfully carry out further projects, and so forth.</p>
<p>But in other sense the success of this proposal mostly rests on a massive disjunction. P<sup>+</sup> is being given an incredible amount of time to think and explore the possible approaches to its problem, and it needs only one of these to work. And those designing the protocol and hoping to actually deploy it would similarly have access to a wide range of procedures to improve and test the protocol.</p>
<p>In a similar sense, human technological progress to date seems to have depended on a huge conjunction of individual capabilities. We must develop fire, then better furnaces, then better metals to make better furnaces, then engines to make better engines to facilitate increasingly intricate chemical processes to make tiny machines to make even tinier machines and so on. The reason that we still see progress is that there are so many possible paths forward that even if 99% of them fail at every stage, there are still plenty that succeed.</p>
<p>I expect P<sup>+</sup> to be in fine shape for similar reasons. It also seems quite plausible that P<sup>+</sup> could resolve its problems without taking many steps, for example by just sitting and thinking for 30 years.</p>
<h3>P<sup>+</sup> may be capable, but not motivated</h3>
<p>Trigger warning: this section is much more aggressively weird than the rest of the post. Thinking about it may be unproductive, or in rare cases distressing. If you suspect you might be distressed by something in a blog post by me, probably stop here. If you suspect that you might lose respect for me because I&#8217;m too weird, you should stop here too.</p>
<p>So far I have only made arguments that P<sup>+</sup> would be capable of doing much better things than we would. But this would be little use if P would not be motivated to apply these capabilities in the intended direction. There are a number of particular reasons to be concerned about this possibility.</p>
<p><strong>Why be motivated at all?</strong> It&#8217;s not clear why P<sup>+</sup> would ever do anything to reflect P&#8217;s enlightened preferences, instead of e.g. playing games all day or going off and building a rich and interesting civilization. There is an implicit assumption that P cares about making its enlightened preferences known and so will put in the effort to do so.</p>
<p>This seems to be a rather severe difficulty with using this process to define the enlightened judgment of most people in most cases. The phrase &#8220;what they would I believe if I thought longer&#8230;&#8221; is actually doing a lot of work already via the implicit counterfactual  &#8220;what if I gave a damn about the question at issue&#8230;&#8221;</p>
<p>I don&#8217;t see a way around this problem, other than simply restricting the applicability of the scheme to people who care about having their enlightened preferences known. I think this points to a more general difficulty with attempts to define enlightened judgment, but not one I&#8217;ll get into.</p>
<p>I would note that the lives of the people in the simulation can really be quite good, especially if P is a model of a large group. (If P is not a model of a large group, their selfish interest may provide them a further reason to bootstrap up to the phase where P<sup>+</sup> can invoke simulations of many different individuals). That is, the effectiveness of the scheme is only very modestly affected if the participants decide to enjoy themselves most of the time, and only work for a few hours a day. Depending on their views about simulations, they may also be able to &#8220;retire&#8221; when their month is up, running simulations of themselves under whatever conditions they like.</p>
<p>So it&#8217;s not necessarily the case that someone needs a massive motivation to make their enlightened preferences known. They may already be able to live a pretty awesome life, and the question is merely whether they care enough about making their enlightened judgment known that they are willing to make modest sacrifices. I think that for many people, knowing that this is &#8220;their purpose,&#8221; they would be happier spending a small fraction of their time working on it. But nevertheless, this is still a material restriction, and the protocol would be expected to work better with people who cared more about the project.</p>
<p><strong>Views about simulations</strong>. One way in which P differs from the original is that it has different beliefs about its own nature. For example, suppose P were a model of me, the me of this very moment, and we were to ask &#8220;what would P do, if they found themselves in a featureless room with a computer terminal?&#8221; <em>I</em> don&#8217;t expect to find myself transplanted to a featureless room with a computer terminal. And so, if I found myself in the room, I would respond with extreme surprise.</p>
<p>But after a while surprise would give way and I would start to ask further questions. In the hypothetical, my experiences would seem just as real to me as they do right now. I would wonder, why is that? Have I been wrong, about what it is that makes experiences real; about what it is to exist? The original thinks that obviously P is just being considered in a counterfactual; P may feel differently about it.</p>
<p>Predicting what I would actually do is a hard question. It&#8217;s not the kind of situation in which people regularly find themselves. For many people, I think that once they got over the shock they would assume that they were in a simulation, or that there was some kind of trick, or maybe in weird cases they would revise their metaphysics. In any case, they would become very interested in the nature of the reality they now found themselves in. Myself, I like to think that I&#8217;m sensible enough that I&#8217;d correctly infer that I&#8217;m merely in a hypothetical<em>, </em>and I&#8217;d make decisions appropriately. But it&#8217;s hard to say.</p>
<p>And what would they infer about the reality they found themselves in? It&#8217;s hard to say, but it&#8217;s certainly not clear that their behavior would be anything like the proposal we had in mind. Perhaps they would just even more strongly accept the self-interested argument to enjoy themselves, believing that they are probably in a simulation rather than an elaborate thought experiment.</p>
<p>Maybe instead they would doubt that their simulation is as it appears, and be concerned that a simulator would make their future a happy or sad one based on what they did. They would then be in the business of speculating about the simulator&#8217;s motivation, and it&#8217;s hard to say how that would end. This is the human version of the problem I described <a href="https://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/">here</a>; to the extent that humans can avoid such problems it shouldn&#8217;t really be any more of a problem in this case than e.g. for the people actually implementing P<sup>+</sup>, who might as well also be in simulations. This makes me less concerned about the problem, though I understand that putting someone in such a radically unfamiliar situation is more likely to lead to trouble. (See the discussion at the end of this subsection.)</p>
<p>(This concern was first articulated as such by Vladimir Slepnev, though similar issues also cropped up in early discussions of the idea.)</p>
<p>Overall this difficulty makes it even more clear that this procedure couldn&#8217;t be used to extract most people&#8217;s enlightened judgments. In addition to possible concerns about capability and motivation, it&#8217;s also the case that many people would probably behave unpredictably once they found themselves in such an implausible situation.</p>
<p>If we are in the regime of just wanting to find anyone who responds to the situation in the intended way, then it becomes plausible that we could find good candidates.</p>
<p>The most principled solution to these difficulties would be to find the perfectly altruistic P that cared <em>only</em> about its enlightened judgment being known, but this seems like a stretch. The next best solution would be to either make a precommitment to not respond to such blackmail, or to find a human who used the kind of decision procedure that led them to reject blackmail on principle. A third best option would be to find humans who simply don&#8217;t buy the simulation argument for any number of reasons, e.g. they have a higher prior on being in a hypothetical than in a simulation, they use a very different framework for reasoning about these questions, etc. That said, if there were philosophical disagreement about the relevant issues, then there would be a significant risk that on reflection P&#8217;s views would change, and the problem would reemerge. I think that these are realistic possibilities to solve the problem at the selecting a person/group stage, but they much limit the pool and are a further thing to keep in mind.</p>
<p>That said, I would not be surprised if after only a short while people came to adjust to the hypothetical as the new normal, accept plausible philosophical assumptions concerning hypotheticals, and generally return to the behavior which they originally intended, mostly ignoring the prospect of being in a simulation in the same way that we do now (and just as today, they could rest rationally secure in the knowledge that they are doing the right thing on altruistic grounds).</p>
<p>A positive aspect of this concern is that it involves confronting a fixed problem rather than an adversary who may behave in an unknown way. That is, we are concerned about a simulator manipulating P not by revealing any new data to P, but by the mere situation changing P&#8217;s outlook. This is a prospect which can be more plausibly evaluated before the fact. It also may be possible to run tests in low-stress environments to understand how (smart agents expect) people to behave in these situations. Since all that matters will typically be how smart agents think the process would unfold, the experiments directly reveal the characteristics we care about. Of course the actual behavior of manipulating simulators may vary between low-stakes and high-stakes situations, but the psychological response of humans involved is less likely to vary along with it.</p>
<p>The problem is sharpened if we expect P<sup>+</sup> to eventually learn a great deal about its universe, for example by simulating many universes and inspecting their properties. If in fact simulators think that it would be worthwhile to simulate it to affect its judgments, then P<sup>+</sup> would eventually learn this by seeing it rather directly (and simulators, realizing this, would be more keen to do so). At this point it is plausible that P<sup>+</sup>&#8216;s behavior would be under further pressure, as it might be able to see the prospect of extortion directly. Subprocesses that noticed this fact may be pressured to report it to their parents or to manipulate their parents in other ways, and it could propagate out to the original agent. There are various complications in this scenario I don&#8217;t want to dwell on, but I think the basic point goes through. I think this particularly extreme form of the problem can be in large part avoided by P constructing explicit precommitments (or taking other measures) in advance of coming into contact with this kind of observation.</p>
<p>Overall I think this issue is kind of muddled, and my current status is &#8220;it seems problematic but definitely not a clear deal-breaker.&#8221;</p>
<p><strong>Change of identity</strong>. Even if P didn&#8217;t respond dramatically to the possibility of being simulated, over time we should expect its identity to diverge from the identity of the original person. This might change its preferences (or perhaps beliefs) to change. As Wei Dai points out, this seems particularly and almost necessarily true for selfish preferences. P&#8217;s self-interest is no longer aligned with the original person&#8217;s self-interest, even if the model is perfect. So instead of eliciting the original subject&#8217;s enlightened preferences, we&#8217;ll get P&#8217;s enlightened preferences.</p>
<p>I consider this problem problematic, but not massively so. There are a number of reasons for my nonchalance:</p>
<ul>
<li>It seems probably possible to find people who are willing to participate in the experiment and who are pretty happy to &#8220;do their job&#8221; of reporting the original&#8217;s preferences, even if on reflection the original would conclude that they care about selfish values. This is most difficult if people reliably converge to similar views upon reflection and these views are mostly selfish; I consider this plausible but not too likely, and even conditioned on this I think that finding someone who would execute the plan is hard but not impossible.</li>
<li>From my current vantage point, I am not so concerned if a simulation of me replaces me in the universe, so even if they care a lot about <em>them</em> per se (and want to replace me with them) I don&#8217;t mind too much. From behind the veil of ignorance, it&#8217;s not something worth sacrificing value for.</li>
<li>I don&#8217;t think that very narrowly selfish values are a large part of what the resources in the universe will get spent on, due to diminishing returns and expanding circle type dynamics. I think that everyone involved can coexist relatively peacefully with respect to selfish preferences. I recognize that this view could easily be overturned by evidence.</li>
</ul>
<p>Overall I would agree that this is a philosophical issue worth thinking more about if this approach actually looks like something we might do, and it&#8217;s not one which can be easily delegated via this procedure.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/402/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/402/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=402&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Specifying a human precisely (reprise)</title>
		<link>https://ordinaryideas.wordpress.com/2014/08/24/specifying-a-human-precisely-reprise/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/08/24/specifying-a-human-precisely-reprise/#comments</comments>
		<pubDate>Sun, 24 Aug 2014 21:43:25 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=399</guid>
		<description><![CDATA[Suppose I want to provide a completely precise specification of &#8220;me,&#8221; or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of &#8220;maximize the extent to which I would approve of your [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=399&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Suppose I want to provide a completely precise specification of &#8220;me,&#8221; or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of &#8220;maximize the extent to which I would approve of your decision upon reflection.&#8221; (I <a href="https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/">have suggested</a> that we would be happy with a powerful AI that made decisions according to this maxim.)</p>
<p>I have written about this issue in <a href="https://ordinaryideas.wordpress.com/2011/12/14/specifying-humans-formally-using-an-oracle-for-physics/">the past</a>; in this post I&#8217;ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I&#8217;m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.</p>
<p>I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.</p>
<p><span id="more-399"></span></p>
<h2>A note on induction</h2>
<p>In this post I&#8217;ll talk about the universal prior and Solomonoff induction; I don&#8217;t think anything in the post really depends on <em>how</em> we approach induction, but having a particular model in mind will help the discussion be more concrete. If our goal is just to produce any definition then we are free to use whatever kind of induction we like so long as we can define it formally; in other contexts it might be important to use a flavor of induction which is compatible with the architecture of a machine intelligence. But again, I think that everything in this post should be very broadly applicable.</p>
<h1>A precise enough specification for human consumption</h1>
<p>First, the goal: I would like to define an interactive machine <em>f</em> : (Stream of inputs) → (Stream of outputs) which captures my input/output behavior. That is, I should be able to call <em>f</em>(&#8220;What is your favorite color?&#8221;) and get the output &#8220;Grey,&#8221; at least for some appropriate encoding of the question and answer. I should also be able to simulate interactions using <em>f</em>. For example, if I had a function <em>f</em> capturing my input/output behavior and a function <em>g</em> capturing your input/output behavior, I should be able to connect them up appropriately and simulate a conversation between us.</p>
<p>So the first step of the proposal is to isolate some input and output channels. These aren&#8217;t something that a human &#8220;comes with&#8221; naturally, and so we have to make a substantive decision about what format the input and output will take.</p>
<p>One choice that seems particularly attractive from an engineering perspective is to use a computer interface. That is, we can imagine a computer (which I&#8217;ll call the &#8220;terminal&#8221;) implementing the following process:</p>
<ul>
<li>Take some bits from the outside world. Interpret these as what to display on the screen. You can do this 60 times a second or whatever. (We could include audio as well in the same way, or other output modalities.) I&#8217;m going to call these the <em>outputs</em> of the terminal. I&#8217;m sorry if this is confusing, but the alternative is also confusing.</li>
<li>Whenever you get input from the human, feed these back to the outside world. This input could be in the form of key presses, or recordings from a microphone, or images over a video camera, or whatever other input device we can connect to a computer. Similarly, we can do this 1000 times a second or whatever. I&#8217;m going to call these the <em>inputs</em> to the terminal.</li>
</ul>
<p>Now if we imagine me sitting at a computer running this program, we have a function <em>f</em>: its inputs are the bits being fed to this computer by the outside the world (which are then displayed on the screen), and its outputs are the bits being fed back by the computer to the outside world (which describe the things I typed). Note that the inputs of <em>f</em> are the outputs of the terminal, and the outputs of <em>f</em> are the inputs to the terminal. Sorry!</p>
<p>Indeed, I could use this function <em>f</em> to ask a question like &#8220;what is your favorite color&#8221; (for example by feeding it inputs leading it to display &#8220;What is your favorite color?&#8221; on the screen) and then get back an answer. I could connect up two copies of <em>f</em>, together with some intermediate processing, and simulate a video call between two copies of myself. And so on.</p>
<p>My responses depend not only on what I have seen from the terminal, but also everything else that is happening in the world. In the original proposal I suggested encapsulating the whole system in a well-insulated box, to make the external influences as simple as possible (and make it as easy as possible to model the contents of the box without thinking of the rest of the world).</p>
<p>This time, I am not going to make that recommendation. Even if I allow the rest of the world to influence me, there is still some function <em>f</em> describing the input/output behavior of the system. It may be a more complicated system, but it is also one which better reflects the normal conditions under which humans operate. It also has other advantages: you can get food and other amenities, you can have a room bigger than a casket, and maybe you can even have periodic contact with the outside world to remain sane. That said, this concession also introduces some extra complications, which I&#8217;ll mention at the end of the post.</p>
<p>The timescales involved here need not be too long; I have in mind something like a day. Because we can re-run the simulation many times, and ultimately craft an improved definition, having only one day of contiguous intellectual work is not as significant drawback as it appears. That said, it may be useful to work with much longer timescales such as weeks or months, and I don&#8217;t see immediate problems with doing so.</p>
<p>So far I&#8217;ve been talking about one person, but we could just as well talk about a whole group of people. We can also allow them to inhabit a larger environment rather than imagining them constrained to sitting at a computer, we could provide them with more resources in that environment, we could give them several terminals, and so on.</p>
<h1>A much more precise specification</h1>
<p>Having specified what exactly we mean by the input/output behavior a person (or some people) implement to our own satisfaction, it remains to specify this input/output behavior in a way that can be communicated to the intended audience.</p>
<p>If we are talking to someone who shares a basic conceptual language with us, we might just be able to give them a description like the one above. I think that many readers will at this point have the same image in mind that I do, modulo some details. I could also imagine machines that would in a similar sense be able to communicate with me. Of course there are still many details of the setup which we would need to specify, for example by picking a particular venue, particular people and states of mind for those people, protocols for interaction, etc. However, that could all be specified extremely unambiguously by actually setting up the physical experiment: gathering the appropriate people to the appropriate place, setting out the computers that will be used, choosing a time, etc. Moreover, I think we should endeavor to use these concepts in a way which is <a href="https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/">robust to imprecision</a>, and so I think it is plausible we could use a natural language description that didn&#8217;t involve physically setting up the experiment or specifying every detail.</p>
<p>More pessimistically, we might imagine trying to communicate this idea to a more alien mind, which either lacks the conceptual framework we use to communicate these ideas to each other, or which has this conceptual framework but lacks a common language to talk about it. I think that even in this case we can provide a relatively satisfactory definition. I find this particularly comforting; it suggests to me that for a very broad range of machine intelligences, we will be able to communicate this proposal.</p>
<p>I think that this more pessimistic assumption is more likely to be warranted if we are interested in using these concepts to define the goals of an agent, because I think that normal approaches to communication between agents <a href="https://ordinaryideas.wordpress.com/2014/08/23/straightforward-vs-goal-oriented-communication/">may break down in this case</a>. </p>
<p>In the pessimistic setting, I want to use a two-step strategy for communicating this concept:</p>
<ol>
<li>Acquire some training data, which I think is sufficient to pin down the concept.</li>
<li>Specify how the concept relates to the training data, and how to obtain the concept using induction on the training data.</li>
</ol>
<h1>The training data</h1>
<p>I want to include two kinds of training data.</p>
<p>The first is a huge stock of observations of the world, of the kind that we might want our AI to have anyway in order to build a good model of the world. These might be passive observations, or it might be important that they include interactions between the AI and the world which allow it to actively improve its model. The assumptions I want to make of this data are that (1) it is sufficient to allow the AI to build a pretty good model of the world, and (2) it includes observations of the individuals and environment which will be used in the final construction.</p>
<p>In practice I think it is quite possible that you would not even have to include the first set of training data explicitly, because an AI reasoning about the world will already have a wealth of such data and will reason in such a way that it automatically influences its inferences about a new batch of training data. But if we don&#8217;t have that data already or it wouldn&#8217;t automatically affect how induction about a new concept worked, we should include it explicitly.</p>
<p>The second class of data is used to directly define the input-output behavior <em>f</em>, as defined in the first section. It is defined in the same way as in my original proposal. That is, to collect it we will do the following:</p>
<ol>
<li>We will actually get together the people or person that we want to model, and bring them to the intended venue under the intended conditions. They will actually spend a day (or whatever) interacting with the intended terminal (which would probably be an actual computer).</li>
<li>We will choose what data to provide them over that computing terminal. For example, we might choose to have a video call with them. We might choose to let them interact with a programming environment, or answer moral questions, or whatever. Ideally we would include a wide variety of kinds of interaction, and for safety&#8217;s sake we would make sure that there was sufficient stochasticity in our choice of data.</li>
<li>We would record every output that was displayed on the terminal, and all of the inputs entered on the terminal. These will comprise the second set of training data.</li>
</ol>
<p>The hope is that a learned mapping from (Terminal outputs) to (Terminal inputs) will be precisely the desired function <em>f</em>. That is, the learned mapping will predict what the people <em>would have</em> typed, <em>if</em> something different had been displayed on the terminal.</p>
<p>We might also try including some terminal outputs that are likely to rattle participants or simulate aspects of relevant counterfactuals. This allows us to review their behavior at some length (to make sure we think they would have behaved well under the appropriate counterfactual) before actually deciding to use the training data in an important application. This could be facilitated by making the counterfactuals we consider look as similar to those used in training as possible when necessary.</p>
<p>The purpose of including stochasticity in the choice of data is to ensure that the map <em>f</em> has to actually make use of the (Terminal outputs) in order to compute (Terminal inputs), rather than just figuring out the (Terminal inputs) directly from physics. This is probably not necessary, given that the universe is pretty stochastic anyway, but it seems relatively cheap. To the same end, it would be nice to collect our first training data <em>before</em> generating the second training data, so that (Terminal outputs) are uncertain even given the first training data.</p>
<h1>The learning</h1>
<p>Now we have two pieces of data: a set of observations O about the universe, and a pair of sequences (Outputs, Inputs) from the terminal. We would like to extract a map <em>f</em> that correctly reproduces the counterfactual outputs of the people in the room, if they had been shown a different set of inputs.</p>
<p>The idea is that a parsimonious explanation of O will hopefully involve a good representation of all of the stuff happening in the environment. So if we have in hand a parsimonious explanation of O, the observations about <em>f</em> can be very simply described by just describing where and when the terminal is. More informally, for an agent who already knows what&#8217;s going on in the world, it should be able to quickly notice that the second set of training data is just what would be typed into a particular terminal in response to a particular output being displayed on the terminal.</p>
<p>More formally, let P be the simplest program (or a simplicity-weighted mixture of programs) such that:</p>
<ul>
<li>On input 0, it outputs the set of observations O.</li>
<li>On input 1 followed by a list of [Terminal outputs], it outputs the next [Terminal input], i.e. it outputs the thing that the humans type after observing a given sequence of screenshots.</li>
</ul>
<p>Our implementation of <em>f</em> is then given by the behavior of P on input 1.</p>
<p>The purpose of requiring P to output O on input 0 is to force it to build up a parsimonious representation of its environment. This hopefully ensures that the second set of training data can be described by an extremely simple modification to this program, one that simply picks out the terminal from that environment.</p>
<p>This is intended to patch the original proposal&#8217;s difficulty with acquiring enough information about a person to uniquely determine them. Now most of the information required to specify me is actually being obtained from O, and the the data from the terminal is only being used to pinpoint the input/output channel. I don&#8217;t know whether it is possible to know everything about me based on all of the recorded data about the world, but I think it is possible to know enough. First, you can infer the basic details of what I value and how I would behave, which I expect is good enough. Second, across the ensemble of all worlds like this one (i.e. with the same recorded history), each version of me gets the right total amount of influence. I think this is probably good enough, even assuming (I think somewhat implausibly) that there were big value divergences between versions of me in very similar worlds.</p>
<p>A significant concern with this procedure is the one <a href="https://ordinaryideas.wordpress.com/2011/12/15/hazards/">I</a> <a href="https://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/">discussed</a> <a href="https://ordinaryideas.wordpress.com/2011/12/21/avoiding-simulation-warfare-with-bounded-complexity-measures/">before</a>, essentially that Solomonoff induction might end up believing the simulation argument and so this definition might be manipulated by motivated simulators. I now think this is somewhat less likely than I did a year ago, and overall I think that we have a good enough suite of options to avoid this issue if we are aware of it. For example, the proposal of finding <em>efficient, simple</em> predictors rather than merely <em>simple</em> predictors, seems like it might be able to avoid predictors which work by locating us in the physical universe, and instead focus on predictors which build a model of their environment.</p>
<p>Vladimir Slepnev&#8217;s and Wei Dai&#8217;s concerns about value changes continue to afflict this proposal. That is, it may be that when someone interacting with the terminal in a counterfactual would have selfish preferences about what happens in the counterfactual rather than the real world, or that they become quite confident that they are in a simulation and become concerned with how the simulator will respond to their decisions. I imagine that psychologically speaking, a more conventional environment would slightly alleviate these problems, but only very slightly. One consideration on our side when dealing with these difficulties is that they don&#8217;t involve dealing with unknown challenges, they just involve understanding people&#8217;s responses to an unprecedented but predictable situation.</p>
<h1>Difficulties from interacting with the world</h1>
<p>In contrast with my original proposal, this specification for &#8220;what I would say&#8221; actually depends on what is happening in the rest of the world. For the most part I think this is fine, and as mentioned it comes with some advantages. It introduces a few new difficulties though.</p>
<p>Namely, if &#8220;what I would say&#8221; is what would get typed on an actual physical terminal that exists in the world, then it can be subject to manipulation by anyone who cares about what I would say and could get access to the terminal. This is not intrinsically problematic, but it does mean that I should take care to ensure that e.g. an AI who cares about what I would say does not have any possible way to influence the experiment. As described above this is almost true for free, because the training data must be collected prior to creating an AI which cares about what I would say; as long as the AI can&#8217;t travel back in time it won&#8217;t be able to manipulate the results. The only thing to keep in mind is that there may be other agents who have an incentive to manipulate the experiment, and we should ensure that they don&#8217;t have an opportunity to do so. These issues are quite sensitive to e.g. the decision-theories used by the involved agents.</p>
<p>There is also a potential difficulty arising from the counterfactual in which the terminal displays unphysical messages. The consequences of this counterfactual may be significant. For example, people from the outside world who observed the phenomenon might radically change their attitude this experiment, and a small number of people might respond in an extreme way. So it seems sensible to limit the flow of information between the people interacting with the terminal and the rest of the world, even if we don&#8217;t go so far as to put the whole thing in a faraday cage with no connection to the outside world.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/399/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/399/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=399&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/08/24/specifying-a-human-precisely-reprise/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Straightforward vs. goal-oriented communication</title>
		<link>https://ordinaryideas.wordpress.com/2014/08/23/straightforward-vs-goal-oriented-communication/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/08/23/straightforward-vs-goal-oriented-communication/#comments</comments>
		<pubDate>Sat, 23 Aug 2014 00:07:42 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/2014/08/23/straightforward-vs-goal-oriented-communication/</guid>
		<description><![CDATA[Will machine intelligences communicate with humans by directly exposing or reporting properties of their internal state, or will they tend to communicate by strategically choosing utterances that they think will have the intended effect on the listener? In this post I try to lay out the distinction more clearly and describe some differences. (Disclaimer: I [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=396&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Will machine intelligences communicate with humans by directly exposing or reporting properties of their internal state, or will they tend to communicate by strategically choosing utterances that they think will have the intended effect on the listener? In this post I try to lay out the distinction more clearly and describe some differences.</p>
<p><span id="more-396"></span></p>
<p>(Disclaimer: I expect this is a distinction that has been explicated elsewhere, but I&#8217;m not aware of it. Pointers are appreciated. I&#8217;m mostly writing this post because it is a distinction I want to make use of in upcoming posts.)</p>
<h2>The distinction</h2>
<p>If I ask someone &#8220;is there a book on the floor?&#8221; I expect them to answer the question by translating the words into a proposition in their internal language of thought, and then to translate their beliefs about that proposition back into words.</p>
<p>By contrast, I <strong>don&#8217;t</strong> typically expect them to listen to my question, use it to infer something about my mental state, backwards chain from their life goals, and then find an utterance which will lead me to respond in the way that best suits their ends. Clearly <em>some</em> amount of goal-oriented reasoning is present: for example, a good communicator tries to understand how their listener is thinking and chooses responses that will be understood or that will engender goodwill. But I think of it as a process which can modulate the much easier &#8220;straightforward&#8221; procedure, rather than the main dynamic.</p>
<p>I think most researchers expect machine intelligence to work in the same way: to translate a natural language utterance into a proposition in an internal language of thought, to asses that proposition, and to translate its beliefs back into language.</p>
<p>By contrast, I think most researchers <strong>don&#8217;t</strong> expect a question-answering system to listen to my question, use it to infer something about my mental state, backwards chain from some ultimate goal (such as user satisfaction), and try to find the utterance which will best satisfy that goal. I think that most researchers would agree that <em>some</em> amount of this goal-oriented reasoning is necessary to really match human-level question-answering performance. But again, it is seen as a process which modulates the &#8220;straightforward&#8221; procedure, rather than the main dynamic.</p>
<p>Let&#8217;s call the first of these two procedures &#8220;straightforward communication&#8221; and the second &#8220;goal-oriented communication.&#8221; Of course there is also a wide range of behaviors that are intermediate between these two, but I think there are two fundamentally different forces at work which can lead to useful communication.</p>
<p>A similar distinction is at work on the listening side. If I hear what you say, I can respond in two different ways: I can either directly translate your utterance into a logical form (or whatever) and manipulate that logical form, or I can treat your utterance as evidence and try to figure out what characteristics of your mental state would have led you to make that utterance. For example, you would say &#8220;I saw Bob this morning&#8221; if you have the memory of having seen Bob, which is most likely if in fact you saw Bob&#8211;and I can perform these inferences regardless of how different my internal model of the world is to yours.</p>
<h2>The difference</h2>
<ul>
<li>Successful goal-oriented communication requires some common values between the speaker and the listener. If the speaker doesn&#8217;t want the listener to understand them, then there is no hope for goal-oriented communication.</li>
<li>For straightforward communication, a &#8220;near-miss&#8221; is most likely to lead to unclarity or a complete failure to communicate. For goal-oriented communication a &#8220;near-miss&#8221; in which the speaker&#8217;s goals diverge from those of the listener can lead to more perverse failures: the speaker may deliberately mislead or manipulate the listener, tell them what they want to hear, etc., without observable indicators of failure. Moreover, a goal-oriented communicator which works well in some contexts may fail to work well in other contexts, since their decisions to communicate accurately is contingent on the belief that accurate communication is useful.</li>
<li>Successful straightforward communication requires the speaker and the listener to have a sufficiently similar internal representation, or to have an explicit procedure for converting between them. Even if the speaker could predict that language is being used in a different way or that the listener won&#8217;t understand a sentence, by default this wouldn&#8217;t lead them to change their behavior, since they are simply relaying their thoughts rather than strategically choosing language to achieve a goal (such as understanding).</li>
<li>Designing a goal-oriented communicator is conceptually straightforward, at least if we set aside the severe difficulties posed by resource limitations. Indeed, communication follows naturally from the desire to coordinate with or manipulate other individuals, and a &#8220;smart enough&#8221; goal-oriented agent of almost any kind would develop communication if it were useful for their goals. But it is not immediately clear how you would design a straightforward communicator at all.</li>
<li>Goal-oriented communication naturally adjusts to different listeners given a strong enough underlying reasoning process, while straightforward communication does not adjust at all without some explicit additional provisions.</li>
<li>Goal-oriented communication naturally replicates many features of human communication:
<ul>
<li>tending to use unambiguous language in unfamiliar situations, for surprising info, or in the presence of noise</li>
<li>using language (e.g. adjectives) in a context-dependent way</li>
<li>leveraging the assumption that the speaker is trying to be informative</li>
<li>employing poetic, elegant, or inspiring language, and so on. </li>
</ul>
</li>
<li>Straightforward communication might be used to communicate goals to an agent, while for goal-oriented communication this appears to beg the question.</li>
</ul>
<p>&#8220;Communication&#8221; here may mean something broader than natural language. For example, I may want to inspect the internal representations being used by a program. I might do this by having some straightforward procedure for translating those internal representations into something that I can understand. But this probably requires a &#8220;good enough&#8221; alignment between the internal representation of the program and the way that I think about the world. Alternatively, I might try to build an agent which strategically structures this information in a way which makes it understandable by me.</p>
<p>For a variety of reasons, particularly the desire to build systems that fail gracefully, I am interested in understanding the feasibility of straightforward communication. That said, I think that the list of differences above mostly suggests to me that goal-oriented communication is a better default for powerful systems. It seems to me that humans are goal-oriented communicators, though we use our common language and cognitive architecture as a (significant) computational expedient which leads us in practice to often communicate straightforwardly (and even to systematically delude ourselves rather than to communicate strategically). Today question-answering systems are mostly straightforward communicators.</p>
<h2>The problem</h2>
<p>There are very simple formal models of goal-oriented communicators (e.g. any formal model of prediction &amp; planning). At the moment, I&#8217;m not aware of good formal models for learning a &#8220;transparent&#8221; representation that could potentially be understood by humans, even neglecting resource limitations. I think there are a lot of plausible approaches to getting traction on this, but I think none of them are yet particularly satisfying. I don&#8217;t know how much more work would be required to find satisfying solutions.</p>
<p>If we think that straightforward communication would be a useful tool for building robustly useful AI, this seems like it might be a worthwhile problem to work on: (1) it might directly improve our ability to make useful AI, and it might be better to work on this problem further in advance of the arrival of AI, (2) it might ultimately prove useful for overcoming computational limitations, if we think that human behavior is a good model for tractable intelligence, (3) it might improve our ability to reason about and discuss future AI.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/396/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/396/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=396&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/08/23/straightforward-vs-goal-oriented-communication/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Approval-seeking</title>
		<link>https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/</link>
		<comments>https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/#comments</comments>
		<pubDate>Mon, 21 Jul 2014 23:08:25 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=350</guid>
		<description><![CDATA[I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances. I described this proposal in a previous post; however, [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=350&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I suggest that it would be desirable if powerful agents acted in the way we would most approve of, upon reflection. I argue that this definition can be made unusually precise, would lead to positive outcomes, and is robust to conceptual imprecision, philosophical errors, and unfavorable circumstances.<br /> <span id="more-350"></span></p>
<p>I described this proposal in a <a href="https://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">previous post</a>; however, that post focused on technical details, and presented an implausible but extremely precise version. In fact I believe that this is a particularly promising angle for thinking about what we would like powerful agents to do. Here I&#8217;d like to look at the more qualitative picture and provide some of the basis for these intuitions.</p>
<h1>What exactly is the proposal?</h1>
<p>I want to define a utility function such that I would be happy if it were exceptionlessly maximized, for example by a powerful AI. The proposal is intended to be straightforward rather than particularly novel or ingenious.</p>
<p>The schema for my proposed utility function is: &#8220;The extent to which <strong>I</strong> would <strong>approve of the AI&#8217;s decisions and their consequences</strong> after <strong>an extensive process of reflection</strong>.&#8221; The three bolded terms require further specification.</p>
<p><strong>&#8220;I&#8221;</strong> refers to the decision-making process that I currently implement (rather than to a concept which could be modified by e.g. changing my physical constitution or making a persuasive argument). I mean to refer to an input/output behavior, without concern for its internal structure.</p>
<p><strong>&#8220;Extensive process of reflection&#8221;</strong> is used to refer to a situation in which a decision-making process is granted extremely extensive resources and may choose how to use them in order to arrive at a conclusion. In this case the conclusion is whether or not to approve of the AI&#8217;s behavior (see below). These resources are intended to be very extensive; for example, they should be sufficient to implement a successor which is considered to better represent the original processes&#8217; &#8220;considered judgment&#8221; or any number of such successors; for example, this might be a smarter or more experienced version of itself, a committee of experts, a formalization of &#8220;what I would say if I didn&#8217;t make epistemic errors,&#8221; etc.</p>
<p>Note in particular that the original decision-maker may appoint a successor defined in terms of further stages of reflection, so that the process is effectively iterated until convergence.</p>
<p>Although this definition may appear very philosophical or vague, it can be made precise by clearly specifying &#8220;extensive resources&#8221; and the format in which the successor is described. Also note that &#8220;extensive resources&#8221; is intended to considerably exceed what is actually available to any agents existing in the real world, perhaps including access to unbounded computational resources which could be used to perform extensive searches for simple computations . The intractability of the procedure does not preclude the AI&#8217;s ability to reason about what its output would be, any more than it prevents us from reasoning about whether or not a particular action would be likely to be met with approval after massive amounts of reflection. However, it allows the evaluator to evaluate possibilities which are beyond the comprehension of any living person, and ensures that the evaluator can (predictably) see through any attempts by the AI to game the system.</p>
<p><strong>&#8220;Approve of the AI&#8217;s decisions and their consequences&#8221;</strong> may mean that the evaluator has access to the literal list of decisions made by the AI, and may then use its own great resources to evaluate their consequences in detail. It then outputs a number in [0, 1] representing the extent to which it approves of this behavior.  The AI&#8217;s goal is to maximize this number.</p>
<p>In a more extreme interpretation the evaluator may not have access to the outputs of the AI directly, but may be tasked with inferring them from what it already knows about the world, the code of the AI, and so on. I think this is a much more attractive, though less intuitively comprehensible, proposal: &#8220;outputs of the AI&#8221; does not seem to be a reliably meaningful concept, and in general it seems good to leave as much conceptual work as possible to the evaluator rather than to us. In this case the process of reflection would need to continue for a very long time, before a radically transformed version used a massive supply of resources to make appropriate inferences about the AI&#8217;s behavior. In this case the AI would also need to use a decision theory such as EDT/UDT/TDT, rather than CDT, since the AI&#8217;s behavior has no causal impact on the evaluator&#8217;s evaluation&#8211;the entire impact flows through what the evaluator will infer about how the AI would behave.</p>
<p>Note that the concept of &#8220;approval&#8221; need not coincide with some intuitive standard of behavior like &#8220;not actively bad;&#8221; instead, the evaluator can set the line arbitrarily. I use the expression &#8220;most approve of&#8221; rather than &#8220;think best,&#8221; because I want to emphasize the bounded nature of the judgments, and stress that the evaluator is free to avoid bad outcomes by disapproving of them, even if (in the AI&#8217;s judgment) there is a small probability that those outcomes would be judged extremely good. Without the boundedness condition I would not expect the procedure to lead to good outcomes, and if the evaluator exclusively judged the AI&#8217;s behavior by how good the resulting outcome was, I would not necessarily expect good outcomes.</p>
<h1>Why is this desirable?</h1>
<p>The most straightforward virtue of this proposal is that it is simple, can be made relatively precise, and appears likely to lead to good outcomes if implemented in the anticipated way. Indeed, there is some reason to think it would be as good as any proposal that we could think of (since we could arrive at such a proposal on reflection). See <a href="https://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">this post</a> for an example of an attempt to make this proposal completely precise, and the later section of the same post for some arguments that the resulting outcome would be good. By contrast, I am not aware of many simple proposals that lead to desirable outcomes, or which have any realistic prospect of being rendered precise.</p>
<p>Perhaps more importantly, this proposal is <em>robust</em> in a number of very important respects</p>
<p><strong>Robust to error: </strong>if we attempt to specify what we value directly, we seem to run into a host of potential errors. We may be wrong about what we want, may use the wrong framework for reasoning under normative or empirical uncertainty, may have an incorrect view of epistemology or have incorrect prior beliefs, may make decisions in the wrong way, may be vulnerable to blackmail, may be incorrectly obsessed with small probabilities of big payoffs, may incorrectly ignore small probabilities of big payoffs, may place too much or too little emphasis on further philosophical inquiry, <i>etc.</i></p>
<p>The approval-based proposal has the virtue that it can avoid essentially all of these problems, by (1) delegating the underlying philosophical questions to the process of reflection, which is in a substantially better position than we are to resolve them, (2) allowing the evaluator to encourage the AI to revise arbitrarily fundamental aspects of its framework. For example, the evaluator could adopt a rule like &#8220;I approve of behavior if and only if it leads to the replacement of this AI by a new AI using framework Q.&#8221; This encourages an AI to reason at a very fundamental level about what kinds of design decisions its creators would have made differently given more reflection, and to make appropriate changes if it thinks the evaluator would prefer that it do so. It also allows the AI to make principled tradeoffs between course-corrections and more immediate concerns in a way which it believes best reflects the desires of its creators (since its creators can decide whether or not they feel it made such tradeoffs in an appropriate way and approve or disapprove as appropriate).</p>
<p><strong>Robust to unfavorable circumstances: </strong>many proposals based on the promise of active collaboration between machine and human intelligence assume either that such collaboration is nearly as productive as unilateral action by machine intelligences, or assume that users of machine intelligences will coordinate to make (potentially modest) productivity sacrifices in exchange for greater fidelity. But in reality &#8220;keeping a human in the loop&#8221; may be difficult, especially in the long run as human labor becomes very expensive compared to computing resources. This solution does not require any assumptions about the availability of human labor, and it allows an AI to develop as rapidly as possible.</p>
<p><strong>Robust to uncertainty:</strong> under realistic conditions a machine will be uncertain about what we would approve of. Because approval is bounded in [0, 1] there are limits on how much this uncertainty can distort its behavior. For example, suppose that the machine entertains a probability of 0.1 that we will behave in a very counter-intuitive way. If we behave in a counter-intuitive way, it may want to take counter-intuitive actions to maximize our approval. These might be actions which we would not approve of if we behaved in the intended way. However, because the machine can only receive 0.1 more expected approval by satisfying our counter-intuitive judgment, this consideration will never justify sacrificing more than 0.11 expected approval <em>conditioned</em> on our behaving in the anticipated way. It seems quite likely that we will have a large enough margin of error that this will make almost no difference to the long-run outcome (see the other points in this section).</p>
<p><b>Robust to conceptual imprecision: </b>when we use language with humans, the best guarantee we typically obtain is that our words probably mean approximately the same thing to us and to them, in the kinds of situations we have encountered so far. The meaning may not be exactly the same, and in unfamiliar environments the concepts may diverge considerably. Let&#8217;s say that a concept communicated in this way is (at best) &#8220;probably approximately correct&#8221; (PAC).</p>
<p>Maximizing a PAC concept of goodness is likely to be problematic; if there are any possible outcomes on which the PAC concept and our actual concept diverge, then the process of maximizing the concept is unusually likely to push us into regions of divergence. These problems become even greater if the domain in which an agent is quite different than the one in which the concept was trained, for example if an agent is making decisions which will have implications for the long-run future.</p>
<p>However, the approval-seeking proposal appears to be robust to the use of PAC concepts for a number of reasons:</p>
<ul>
<li>&#8220;What I would say&#8221; may vary as the concept of &#8220;me&#8221; is varied, but on most readings of &#8220;me&#8221; I will tend to say things that are similar <em>on the axes that are most important to me</em>.  The variation due to conceptual uncertainty is channeled into dimensions where I am particularly ambivalent.</li>
<li>By channeling &#8220;what I would say&#8221; through a process of reflection, we can avoid asking &#8220;what I would say&#8221; in some extreme situation. For example, we need not ask &#8220;would I approve of your decision if I somehow understood all of the long-term consequences?&#8221;; this is good, since PAC concepts of &#8220;what I would say&#8221; might diverge in such an unusual and extreme case (in which we would have to extrapolate what it means to &#8220;really understand&#8221; something far outside of the usual range of human abilities). Instead, we can ask &#8220;What kinds of decision-making procedures do you think would be slightly wiser than your current self?&#8221; This is a question which can be posed to me as I currently exist, and two people who agreed about my behavior in a broad enough range of everyday situations would be likely to agree about my answers to this question.</li>
<li>The process of reflection and approval can be specified unusually precisely; for example, they can be given a purely formal definition in terms of a given input/output behavior. This ensures that only the concept of &#8220;I&#8221; is subject to significant conceptual imprecision.</li>
<li>The concept of &#8220;my decision process&#8221; also admits a relatively precise definition, which minimizes the extent of imprecision. This is because the physical instantiation of a person can be relatively cleanly separated from their environment, and can be transplanted from one physical environment to another. This includes, for example, an environment in which they interact with a computer. When a person interacts with a computer, they can implement a completely precise input/output behavior (the computer provides an input on the monitor, the human provides an output by typing). Putting these ingredients together, we can relatively unambiguously understand a human as implementing a precise input/output behavior.</li>
<li>The process of approval can act on behaviors in ordinary environments rather than merely acting on the long-term consequences of an AI&#8217;s decisions, and so can yield good outcomes even if no extreme situations are ever encountered. For example, we might approve of an AI&#8217;s behavior only if it avoids doing anything that unnecessarily reduces human influence, or we can approve of the AI&#8217;s behavior only if it avoids making radical changes to its environment. At the same time, the process of approval can meaningfully depend on the long-term consequences of outcomes, and not merely short-term outcomes.</li>
</ul>
<h1>Comparison with CEV</h1>
<p>This proposal has many important similarities to Eliezer Yudkowsky&#8217;s <a href="https://intelligence.org/files/CEV.pdf">CEV</a>; I read that document some time ago and I expect it has had an influence on my thinking. The primary difference is that this proposal relies on iterating the single relation, &#8220;A &#8211;&gt; decision-making procedure that A endorses&#8221; defined in a relatively straightforward way&#8211;for example by presenting A with the tools to define a new decision-making procedure&#8211;rather than by relying on a more complex understanding of motivation, reflection, and idealization. One consequence is that this proposal is more amenable to formalization. Another is that this proposal leaves room for A to err (or for A&#8217;s replacement to err, and so on).</p>
<p>The difference seems to be due primarily to disagreement about &#8220;how good does the initial decision-making dynamic need to be, in order to avoid an unacceptable risk of significant error?&#8221; My overall view is that we can adopt a relatively straightforward rule such as this one and improve the rule with relatively straightforward modifications (such as taking a long time to think, consulting many copies of ourselves, having a discussion with a slightly enhanced version of ourselves to verify that their reasoning remains convincing to us, and a very wide range of less pithy techniques). If it is important to resolve hard philosophical problems, for example those required to define &#8220;volition,&#8221; this can then be done from a much-enhanced future state. It may be that a relatively straightforward rule is <em>not</em> sufficient, but in that case it seems clear that the existing human society is already doomed: after all, the process we are currently using to decide what successors to create is just one example of a particular simple rule for deciding on a decision-making procedure that we endorse.</p>
<p>A more charitable interpretation of this criticism is that <em>one person</em> operating in isolation is at a significant disadvantage, when compared to the existing community of thinkers. I find this objection relatively uncompelling, because the move from &#8220;what I would do&#8221; to &#8220;what I would do with appropriate experimental feedback&#8221; to &#8220;what a bunch of copies of me would do with appropriate experimental feedback&#8221; to &#8220;what our society would do under ideal conditions with appropriate experimental feedback&#8221; does not seem very complex or difficult. I agree that it may be more robust to work out formal versions of those decision-making procedures in advance, prior to communicating the objective to an AI, rather than trying to bake them into the reflective process. However, if this is the only sticking point with this proposal, it seems to be an engineering difficulty which should be addressed in the future as the process arises.</p>
<p>Another possible complaint is that CEV is handicapped in a way that is advantageous, by preventing the individuals owning the AI from securing a personal advantage. By contrast, I am mostly interested in the transition to AI because I think it presents a unique opportunity for the influence of humane values to be diminished. From this perspective, I am happy with proposals which simply allow individuals to maintain control over AI&#8217;s of their own construction, and I expect this would lead to a world with broadly similar prospects to our own. It may further be the case that a transition to AI could lead to massive inequality; I would prefer tackle this as a separate (probably primarily political) problem. I am open to the possibility that technical work could help address that problem, but I don&#8217;t see a comparably good argument that we should be doing that technical work now. (If I wanted to work on that problem now, I would probably pursue policies aimed at promoting more equal ownership of capital, or policies aimed particularly at ensuring that the fruits of an AI-based productivity boost were broadly shared.) It may also be the case that one could use a transition to AI as an opportunity to achieve a substantial change in social values, but in general I would be quite hesitant about such proposals.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/350/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/350/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=350&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2014/07/21/approval-seeking/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Some rambling AI prognostication</title>
		<link>https://ordinaryideas.wordpress.com/2013/01/22/some-rambling-ai-prognostication/</link>
		<comments>https://ordinaryideas.wordpress.com/2013/01/22/some-rambling-ai-prognostication/#comments</comments>
		<pubDate>Tue, 22 Jan 2013 00:04:04 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=244</guid>
		<description><![CDATA[I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I&#8217;m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=244&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I want to get in the habit of sharing more of my unpolished thoughts about topics I consider important. The hope is to shift from an equilibrium where I say little (and therefore feel like I&#8217;m endorsing whatever I do say as having an unusually high quality, causing me to say even less), to an equilibrium where I say much more and feel more comfortable sharing unpolished thoughts. I think &#8220;quiet&#8221; is an OK equilibrium (most people who should read some of my thoughts shouldn&#8217;t read most of them, and it would make sense for me to try and be selective). But it seems like a suboptimal equilibrium, since there are at least a few people who do care what I think, often to better understand our disagreements.</p>
<p>A similar social problem is possible, where a broader intellectual community tends towards a &#8220;quiet&#8221; equilibrium in which any public speech comes with an implicit claim of interestingness and worthwhileness. I think intellectual communities vary considerably in where they sit on this spectrum. Bloggers qua bloggers often write quite a bit; academics in computer science tend to be much more guarded about what they say. I think there are some virtues to the noisier equilibrium, particularly in increasing our ability and inclination to notice, understand, and resolve disagreements.</p>
<p>Anyway, in that spirit, <a href="https://workflowy.com/shared/15df86ce-1b8e-57ca-dbb2-30a42d949a59/">here</a> is some of my thinking about AI&#8212;an outline of the development scenario I consider most likely, and a general discussion of the impacts of consequentialist automation. Criticism is welcome, and you can leave comments on the (quite hard-to-navigate) google doc <a href="https://docs.google.com/document/d/1tKCysndd8-SRWnXG6lp-XzmPppGYucmeG4r4lUfhjWY/edit">here</a>.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/244/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/244/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=244&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2013/01/22/some-rambling-ai-prognostication/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Improbable simple hypotheses are unbelievable</title>
		<link>https://ordinaryideas.wordpress.com/2012/12/11/improbable-simple-hypotheses-are-unbelievable/</link>
		<comments>https://ordinaryideas.wordpress.com/2012/12/11/improbable-simple-hypotheses-are-unbelievable/#comments</comments>
		<pubDate>Tue, 11 Dec 2012 01:33:27 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=236</guid>
		<description><![CDATA[Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I&#8217;ll briefly argue: If H is a simple hypothesis and the prior probability of H is very small, then [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=236&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Intuitively, it seems that any simple hypothesis should receive a reasonable prior probability. It seems unreasonable to assign a simple hypothesis a ridiculously low probability. In some sense this is a matter of preference, but here I&#8217;ll briefly argue:</p>
<blockquote><p>If H is a simple hypothesis and the prior probability of H is very small, then there is nothing you could possibly see which would convince you of H.</p></blockquote>
<p><span id="more-236"></span>Which suggests that this is at a minimum a very <em>dangerous</em> thing to do (though I&#8217;m open to the possibility that there are things you should simply never believe, no matter what you see).</p>
<p>The argument is basically a formalization of the platitude that you can only ever get so much evidence, because hey, it&#8217;s not that unlikely that you are crazy. In fact that claim is false, as <a href="http://jsteinhardt.wordpress.com/">Jacob Steinhardt</a> was quick to point out when I made it recently. The possibility that you are crazy or in a simulation doesn&#8217;t preclude you making arbitrarily large updates. It just precludes making arbitrarily large updates in favor of simple hypotheses.</p>
<p>First, by &#8220;simple hypothesis&#8221; I mean a hypothesis which has a short description in whatever internal language you use to represent hypotheses. So &#8220;God exists&#8221; and &#8220;God doesn&#8217;t exist&#8221; count, so long as your brain assigns them subjective probabilities.</p>
<p>Second, by &#8220;prior probability&#8221; I really mean the probability you assign at some arbitrary point in time, in particular after updating on all of your introspective access. Then the result will imply that at no <em>future</em> time can you come to believe H, without forgetting the evidence you have already encountered.</p>
<p>Third, by &#8220;very small&#8221; I mean much smaller than 2<sup>-|H|</sup>. And by &#8220;much smaller&#8221; in that sentence I mean by a large factor (which doesn&#8217;t depend on H). 2<sup>100</sup> should be plenty.</p>
<p>Fourth by &#8220;convince of X&#8221; I mean &#8220;cause you to believe X as the result of a correct Bayesian update.&#8221;</p>
<p>Finally, I&#8217;m going to make the assumption that you don&#8217;t assign negligible probability to skeptical scenarios (like the simulation hypothesis, insanity, or your experiences being an elaborate hoax). This can be justified by observing that such skeptical scenarios apply to a non-negligible fraction of observers in many reasonable worlds (for example, physical universes like ours). In fact, in many reasonable worlds skeptical scenarios apply to all observers. So in order to assign skeptical scenarios negligible probabilities, you&#8217;ve got to assign a whole lot of otherwise reasonable worlds negligible probabilities.</p>
<p>Having made those definitions the claim is now pretty straightforward. Consider some arbitrary evidence E you might receive, which might constitute substantial evidence in favor of H. I&#8217;ll exhibit a hypothesis Q such that P(H) &lt;&lt; P(Q) and P(E|H) = P(E|Q), where P is your prior. This implies that P(H|E) / P(Q|E) = P(H) / P(Q) &lt;&lt; 1, so in particular P(H|E) &lt;&lt; 1.</p>
<p>Q is the hypothesis: H is false, however your experiences are manufactured according to the distribution P( * | H). Clearly then P(E|Q) = P(E|H). To see that P(Q) &gt;&gt; P(H), consider the claim S: <em>a hypothesis H&#8217; was selected uniformly at random using my internal representation of hypotheses, and then my experiences were manufactured according to the distribution P( * | H&#8217;)</em>.</p>
<p>If you assign skeptical hypotheses reasonable probability, P(S) should not be not too small. Moreover, P(Q|S) is at least 2<sup>-|H|</sup>. Thus P(Q) &gt;&gt; P(H), as desired.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/236/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/236/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=236&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2012/12/11/improbable-simple-hypotheses-are-unbelievable/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Are you special? Pascal&#8217;s wager, anthropic reasoning, and decision theory</title>
		<link>https://ordinaryideas.wordpress.com/2012/12/05/are-you-special-pascals-wager-anthropic-reasoning-and-decision-theory/</link>
		<comments>https://ordinaryideas.wordpress.com/2012/12/05/are-you-special-pascals-wager-anthropic-reasoning-and-decision-theory/#comments</comments>
		<pubDate>Wed, 05 Dec 2012 08:18:40 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=221</guid>
		<description><![CDATA[Here is an argument that some people might find compelling: It may be that the world is mad, and that as the only sane person around it falls on me to make sure we don&#8217;t all kill ourselves. If that&#8217;s the case, then my impact on the world may be huge. Let&#8217;s say that in [&#8230;]<img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=221&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Here is an argument that some people might find compelling:</p>
<blockquote><p>It may be that the world is mad, and that as the only sane person around it falls on me to make sure we don&#8217;t all kill ourselves. If that&#8217;s the case, then my impact on the world may be huge. Let&#8217;s say that in this case, I can improve the world by 1%.</p>
<p>Maybe the claim that I&#8217;m particularly influential, call it proposition P, isn&#8217;t certain. But at least there&#8217;s a good chance.  Subjectively it feels like about 1%, since if I looked at 100 similarly surprising facts, I would expect one of them to be true. (I wouldn&#8217;t be <i>that</i> surprised to discover that I&#8217;m the most important person ever&#8230;) That still leaves me with the ability to improve the world by 0.01% in expectation, which looks pretty good. I might as well not even worry about stuff I could do that would improve the world by a mere 0.001%, like being an extraordinarily successful entrepreneur.</p></blockquote>
<p><span id="more-221"></span></p>
<p>What is wrong with this argument? Intuitively, the trouble is that out of the 7 billion people on Earth, at most a handful can be so important. So even if you discovered evidence that suggested P quite strongly, you ought to remain skeptical. Even if a magic 8-ball which lied only one time in a million told you that you were the most influential person alive, you should still bet against it&#8212;after all, 7,000 people will hear this particular lie, while only one will be right. (Setting aside the fact that your mere possession of such an 8-ball constitutes much more than million to one evidence!)</p>
<p>If you have some clever argument that you aren&#8217;t &#8220;in the same reference class&#8221; as those other 7 billion people, you need to be awfully sure that it would be difficult to manufacture that argument if you <i>weren&#8217;t</i> in fact the most influential person. If you had a 1 in a million chance of being able to delude yourself into thinking you were special, you&#8217;d still be wrong nearly 99.99% of the time. But replies the skeptic&#8230;</p>
<blockquote><p>Putting a prior probability of 1 in 7 billion on something plausible is ridiculous! After all, if the argument you just gave has even a 1% chance of being wrong, then I might have a prior probability of up to 1%! Do you think you could make 100 arguments that compelling, before you messed one of them up?</p></blockquote>
<p>One response to the situation is to say that you really are so confident, because this kind of anthropic prior improbability is a special case. I think this is probably untenable, because your reasoning really <i>isn&#8217;t</i> that good. If you had to make 7 billion independent arguments as complicated as this one, I&#8217;d be surprised if you didn&#8217;t mess up one of them on a technicality.</p>
<p>Another response to this situation is to throw up your hands and discount the possibility P as an instance of <a href="http://www.nickbostrom.com/papers/pascal.pdf">Pascal&#8217;s mugging</a>. Maybe we don&#8217;t understand <i>why</i> we shouldn&#8217;t act on the basis of such small possibilities of large upsides, but it&#8217;s intuitively obvious it would be wrong.</p>
<p>If we take the perspective of evidential or timeless decision theory, however, this problem vanishes. In these theories, we use a different decision rule: take the action which you would be happiest to learn that someone in your situation had taken. To decide what to do in situation S, compute E[ U | &#8220;in situation S I would pick action A&#8221; ] and E[U | &#8220;in situation S I would pick action B&#8221; ], and choose whichever action leads to the higher utility.</p>
<p>In this framework, we should no longer assign a non-negligible probability to being confused about anthropic questions, because such questions are never asked&#8211;the relevant properties are baked directly into the decision rule. Suppose that there are a billion people, P is true for exactly one of them, and I receive some evidence that is a million times more likely if P is true. Then I&#8217;m given the option to take some gambit, which increases U by 100 if P holds, and decreases U by 1 otherwise. Now if I am 99% sure that my basic picture about reality is correct I can reason:</p>
<blockquote><p>In 99% of (impossible) possible worlds, there are 1000 observers with the evidence I have. P is true for one of them and not true for 999 of them. So if I choose to take the gambit, I will lose 999 utility and gain 100, which is a net loss.</p>
<p>In the remaining 1% of possible worlds, maybe it&#8217;s just me, and maybe property P is true. And in those worlds I would gain 100 utility. This is only 1 utility in expectation, which doesn&#8217;t offset the -900 from the other worlds.</p></blockquote>
<p>Of course, I&#8217;ve swept a few important things under the rug; most importantly I&#8217;ve assumed that U is non-indexical. (It works fine if U = &#8220;total # of happy years of life&#8221; or U = &#8220;total # of happy years of life for people with my experiences so far&#8221; or so on. But if U = &#8220;# of happy years of life <i>I</i> have&#8221; then it is going to come down to anthropic questions in the definition of &#8220;I&#8221;.)</p>
<h3>The original Pascal&#8217;s mugging</h3>
<p>Incidentally, Pascal&#8217;s mugging is structurally identical to the argument we just discussed. Nick Bostrom describes an unarmed mugger who approaches M. Pascal:</p>
<blockquote><p>Mugger: Let us say that the 10 livres that you have in your wallet are worth to you the equivalent of one happy day. Let’s call this quantity of good 1 Util. So I ask you to give up 1 Util. In return, I could promise to perform the magic tomorrow that will give you an extra 10 quadrillion happy days, i.e. 10 quadrillion Utils. Since you say there is a 1 in 10 quadrillion probability that I will fulfill my promise, this would be a fair deal. The expected Utility for you would be zero. But I feel generous this evening, and I will make you a better deal: If you hand me your wallet, I will perform magic that will give you an extra 1,000 quadrillion happy days of life.</p>
<p>Pascal: I admit I see no flaw in your mathematics.</p></blockquote>
<p>Whatever clever argument we might suggest Pascal could use to decide that the mugger&#8217;s offer is unattractive, the mugger could always ask: &#8220;But surely, M. Pascal, there is some chance that you are mistaken?&#8221; This seems to be something of a reductio against unbounded utility. Robin Hanson is reported to have observed that, in any world large enough to contain 10 quadrillion (or whatever number) of valued objects, there are (reasonably likely to be) a comparable number of observers; most of them who believe that they have the power to create or destroy so much value must be deluded. But more importantly, each of them who is so deluded could create a constant amount of value themselves. So my large EV from the possibility I&#8217;m not deluded is balanced by my large EV from controlling more folks&#8217; actions if I am deluded.  And now the situation is transparently the same as with our proposition P. We don&#8217;t need to assign any probabilities near 1 to avoid the trouble.</p>
<p>(Of course, you are still going to run into divergent sums if you accept the kind of arguments Pascal does in the example, which seems to be a <a href="http://singularity.org/files/Convergence-EU.pdf">fundamental problem</a> with unbounded utilities. But Pascal&#8217;s mugging is already a problem if you take some mind-boggling upper-bound on the size of the universe, and at least this works then.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/221/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=221&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://ordinaryideas.wordpress.com/2012/12/05/are-you-special-pascals-wager-anthropic-reasoning-and-decision-theory/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="https://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
	</channel>
</rss>
