<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Ordinary Ideas &#187; Formal Definitions</title>
	<atom:link href="http://ordinaryideas.wordpress.com/category/ai-safety/formal-definitions/feed/" rel="self" type="application/rss+xml" />
	<link>http://ordinaryideas.wordpress.com</link>
	<description>As advertised</description>
	<lastBuildDate>Wed, 31 Dec 2014 03:26:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='ordinaryideas.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title> &#187; Formal Definitions</title>
		<link>http://ordinaryideas.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://ordinaryideas.wordpress.com/osd.xml" title="Ordinary Ideas" />
	<atom:link rel='hub' href='http://ordinaryideas.wordpress.com/?pushpress=hub'/>
	<item>
		<title>The motivated simulator argument</title>
		<link>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/#comments</comments>
		<pubDate>Thu, 24 May 2012 16:56:22 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>
		<category><![CDATA[Priors]]></category>
		<category><![CDATA[Solomonoff Induction]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=214</guid>
		<description><![CDATA[Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=214&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Sometimes we may want to formally talk about objects that actually exist in the world, using mathematical language. One way to try to do this is by recording some sequence of observations about the world, and then applying Solomonoff induction. The hope would then be: if you apply Solomonoff induction to the sequence of things you&#8217;ve seen so far, it will correctly predict whatever you next see. In this post I&#8217;ll describe a problem with applying this approach to anything particularly important.</p>
<p><span id="more-214"></span></p>
<p>If we look at the <i>simplest</i> programs that account for your observations, once you&#8217;ve observed enough things (say, many more observations than there are bits in the description complexity of the universe), Solomonoff induction finds predictors of the form, &#8220;Simulate the universe with description D. Then look inside that universe, at the spacetime trajectory T, and return the bits encoded along that trajectory (using some encoding protocol P).&#8221; Thus, Solomonoff induction approximately reduces to what is called <i>counting measure</i> over observers within our own universe&#8212;the probability that you will next observer O is roughly the fraction of observers who have seen everything you&#8217;ve seen so far and next observe O. (This isn&#8217;t quite right, because observers later in time are harder to specify&#8212;it takes more bits to write down their spacetime coordinates&#8212;and so they get slightly discounted. There are a few other issues like this, but they don&#8217;t change the basic picture.)</p>
<p>But if you apply Solomonoff induction, and you plan to <em>use the result of this computation for something <a href="http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/">really important</a></em>, most of the probability mass is actually controlled by sequences which have been constructed artificially&#8211;either within a simulation (see the<a href="http://www.simulation-argument.com/"> usual simulation argument</a>) or just as a sequence of inputs written down in really big letters over and over again, for the precise purpose of manipulating the counting measure. So, for example, if I build an AI and try to incentivize it by saying &#8220;if you do the right thing, I will provide a reward,&#8221; and if that AI uses approximate Solomonoff induction, <i>and</i> if the AI was positioned to steal 1% of the universe by defecting (though we&#8217;ve set up things so cleverly that it would first get punished if it tried)&#8230; <i>then</i> we might expect some future folk to control that AI by building simulations of it, and rewarding the simulations if and only if the AI tries to defect and give its 1% of the world to those manipulative future folk. (Who could then&#8212;perhaps?&#8212;use this 1% of the world to manipulate the original AI.)</p>
<p>It&#8217;s a wacky situation, but I think at the kernel is an interesting phenomenon. In this post I&#8217;m just going to try to argue that this is, in fact, what would happen. To get a handle on things, I&#8217;m going to start by looking at similar situations in <a href="http://ordinaryideas.wordpress.com/2011/12/14/cellular-automata/">cellular automata</a>.</p>
<h2>Deterministic Cellular Automata</h2>
<p>Suppose that we have some (modestly complex) cellular automata, which contains some observers doing science. The observers want to find a formal mathematical model for their physics (and, say, use it to direct an AI). So they take a sensor, put it at rest in the cellular automata&#8217;s preferred rest frame, let it sense for a while, and then condition the universal prior on agreement with the sensor&#8217;s inputs. Then they ask: what does that distribution predict the sensor will sense next?</p>
<p>What do they hope for? They hope that the distribution will predict whatever the sensor &#8220;actually&#8221; senses. So that, for example, if I gave an AI the task of making sure the future inputs on the sensor recorded us pressing an APPROVE button, the AI would really do something that would cause us to press APPROVE.</p>
<p>Suppose the sensor&#8217;s input is X. What short programs output X? Well, if the cellular automaton is relatively simple, one simple program will be P, defined as follows. P has the laws of physics, a mechanism for translating the sensor&#8217;s raw physical state as bits, the position of the sensor and the start time hard-coded into it, and P works by running the laws of physics, looking in the spacetime locations specified, and running the specified function to transform the sensor&#8217;s physical state into bits. If the sensor is particularly simple&#8211;for example, if it runs on a regular clock, and if there is a particular cell of the cellular automata which perfectly correlates with the sensor&#8217;s reading, then P is an extremely simple program, and generally P is much simpler than any &#8220;local&#8221; algorithm for predicting X unless X is very short or the sensor is very well isolated from the environment.</p>
<p>But for any other compactly specified sequence S of spacetime points,  there is a different program Q(S) which outputs the sequence of observations embedded at those points. Each program Q(S) has roughly comparable complexity, including P; programs with larger coordinates are slightly more complicated, but not much more so&#8211;doubling the position of a point only increases it&#8217;s description complexity by a constant, decreasing the relevance of that point&#8217;s contributions by only a constant factor.</p>
<p>So the part of the universal distribution made up of the programs Q(S) is roughly a counting measure over observers within the cellular automata (time and space-discounted by something <em>very slightly</em> superlinear, just enough to make the sum converge). But most of the counting measure really does lie within simulations. And if someone has an incentive to manipulate such simulations, or to simply directly embed the sequence X in the universe, then <em>their motivations</em> will control the predicted continuation of X under the universal prior.</p>
<p>It seems that this may be a deal-breaker when considering the universal prior for any application of this form. If you build an AI that wants the approve button to be pressed and you use this technique to define its utility function, you aren&#8217;t going to get an agent that cares about the &#8220;real&#8221; world. Instead, you will get an agent that cares about all of the simulators which have motivations to mess with it. Maybe that leads to good outcomes&#8211;it just depends on whether values we like control most of the available computational resources (and how much of their resources would be wasted fighting these sorts of fights).</p>
<h4>Complications</h4>
<p>If the sensor in question is harder to specify&#8211;whether because it is traveling or timed or interpreted inconsistently, this technically makes the above argument stronger. After all, then P is contributing less to the universal prior.</p>
<p>However, it may be that <em>every</em> embedding of X in the universe is rather hard to locate, so that even an incentivized agent cannot really write down X in a way that can be picked out by a simple program. If these difficulties were severe enough, then programs Q(S) may collectively not contribute much to the measure of X. Instead, most of the mass may come from simpler &#8220;local&#8221; prediction algorithms, like those humans use to predict/compress their environment, which are ignorant about the exact state of the rest of the universe but fill in the details by guessing.</p>
<p>But this seems unlikely. The basic problem is that the difficulty of picking out an input sequence seems <em>tiny</em> compared to the difficulty of filling in all of the unknown details of the universe. The extra complexity of specifying timing info, movements of the substrate in which X is written, and whatever messy algorithm is necessary to interpret X, seem like they must be at most a modest constant (independent of length of X); once the length of X is significantly larger than that constant, the losses from guesswork will greatly exceed this constant, unless the sensor is very well isolated from the rest of the environment. Intuitively, the sensor is (at a minimum) well enough connected to the environment if its observations are enough to infer many properties of the outside world (for example, if the sensor saw enough to infer something about an observer&#8217;s knowledge of cosmology, this would give an advantage to models at large enough scales to account for those cosmological features).</p>
<h2>Randomized Automata</h2>
<p>For randomized automata the case is rather different. In order to pick out a sequence of observations, generically, we need to provide not only coordinates but also the entire random seed. The length of the necessary random seed is not just long, it&#8217;s much longer than any input sequence X which we might plausibly extract from the world (and more randomness is injected at each step, which guarantees that you always need more randomness than you get out information). So each program Q(S) is contributing rather little to the mass of X under the universal prior&#8211;the probability that the sequence of observations at S would be equal to X.</p>
<p>I strongly suspect that we attain the same conclusions in the randomized as in the deterministic case, however.</p>
<p>What simple programs other than Q(S) explain X? As above, the natural candidates are coarse &#8220;local&#8221; explanations, which may maintain a detailed model for the local environment, but not the rest of the universe. But it seems unlikely that the randomness of the underlying cellular automata really favors such explanations. Indeed, consider some local observations distributed under the real cellular automata. There are two cases: (1) the distribution of observations under the local predictor is the same. In this case, the local predictor is losing out for the same reason that local predictors lose out in deterministic phenomenon. (2) The distribution of observations under the local predictor is different. In this case, by definition, the average log probability assigned by the real distribution is better than that of the local predictor, so the local predictor does worse than in case (1).</p>
<p>What other programs might there be? I don&#8217;t know; I can&#8217;t think of any. Generically it will be impossible to rule out the possibility of some other, still shorter, algorithms to account for our observations, but in this case I don&#8217;t know why we would expect one to exist.</p>
<p>I think the sense that something tricky is going on here (at least, I have that sense) is coming from different intuitions being applied to &#8220;coordinates&#8221; in some exponentially large probability distribution vs. uncertainty in an algorithm like the ones humans use. Of course the two are precisely equivalent, but it feels like locating myself in a universe parametrized by an astronomically large random seed is very hard, while it feels like my local prediction algorithm is pretty good.</p>
<p>The case for our universe seems to be broadly analogous for a general randomized cellular automata. Yes, we are less sure about how we might encode a program which extracts observation sequences from the world, but it would be surprising if these obstructions were insurmountable, for the same reasons as given in &#8216;Complications&#8217; above.</p>
<h2>Avoiding the Trouble</h2>
<p>The obvious way to avoid this sort of thing is to avoid the universal prior. I mentioned <a href="http://ordinaryideas.wordpress.com/2011/12/21/avoiding-simulation-warfare-with-bounded-complexity-measures/">before</a> the possibility of using a prior which penalizes algorithms which use a lot of time or a lot of space. This seems to rule out these particular problematic models, and force you to use a coarse local model. It is probably worth thinking a bit about how those coarse local models generically perform (say, in cellular automata).</p>
<p>It seems like there should be a more philosophically sophisticated way to point to &#8220;basements&#8221; than this, though; using space-bounded models feels like a bit of a hack, and throws out a lot of information. What we really care about, in general, is some measure of &#8220;generic&#8221; decision-theoretic relevance (or relevance to our personal values). We care about non-simulated copies of ourselves because they have much more influence on the world. Maybe there is some way to access this measure directly. Alternatively, it may be possible to bargain with agents as a way of explicitly eliciting predictions with respect to this measure&#8212;maybe you can&#8217;t define what &#8220;important&#8221; means directly, but you can define it implicitly by making sure those agents also care more about worlds which aren&#8217;t simulated. But this all seems to be surprisingly difficult. I think there is a very small probability that this captures a rather serious obstruction to defining things formally. (Perhaps it is easy to define something well enough when no one is looking, but once anyone cares whatever definition you use is going to get manipulated. If something like this were really the case, it would create systems with very exotic failure modes.)</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/214/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/214/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=214&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>A formalization of indirect normativity</title>
		<link>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/#comments</comments>
		<pubDate>Sat, 21 Apr 2012 02:51:12 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=121</guid>
		<description><![CDATA[This post outlines a formalization of what Nick Bostrom calls &#8220;indirect normativity.&#8221; I don&#8217;t think it&#8217;s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the &#8220;not terrible&#8221; bar, i.e. which does not lead to terrible consequences if pursued without any caveats [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=121&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p dir="ltr">This post outlines a formalization of what Nick Bostrom calls &#8220;indirect normativity.&#8221; I don&#8217;t think it&#8217;s an adequate solution to the AI control problem; but to my knowledge it was the first precise specification of a goal that meets the &#8220;not terrible&#8221; bar, i.e. which does not lead to terrible consequences if pursued without any caveats or restrictions.<span id="more-121"></span></p>
<h1>Introduction</h1>
<p dir="ltr">When faced with the challenge of writing down precise moral principles, adhering to the standards demanded in mathematics, moral philosophers encounter two serious difficulties:</p>
<ul>
<li>Basic notions, like “agent,” “act,” and “motive” are themselves almost inherently imprecise. More complex concepts like “well-being,” “possible world,” and “honesty” are even less approachable from a formal perspective.</li>
<li>Even expressed in terms of these notions, satisfactory moral principles have proven incredibly elusive. Committing ourselves to any simple, exceptionless theory has proved incompatible with basic ethical intuitions.</li>
</ul>
<p dir="ltr">In light of these difficulties, a moral philosopher might simply declare: “It is not my place to aspire to mathematical standards of precision. Ethics as a project inherently requires shared language, understanding, and experience; it becomes impossible or meaningless without them.”</p>
<p>This may be a defensible philosophical position, but unfortunately the issue is not entirely philosophical. In the interest of building institutions or machines which reliably pursue what we value, we may one day be forced to describe precisely “what we value” in a way that does not depend on charitable or “common sense” interpretation (in the same way that we today must describe “what we want done” precisely to computers, often with considerable effort). If some aspects of our values cannot be described formally, then it may be more difficult to use institutions or machines to reliably satisfy them. This is not to say that describing our values formally is necessary to satisfying them, merely that it might make it easier.</p>
<p>Since we are focusing on finding any precise and satisfactory moral theory, rather than resolving disputes in moral philosophy, we will adopt a consequentialist approach without justification and focus on axiology. Moreover, we will begin from the standpoint of expected utility maximization, and leave aside questions about how or over what space the maximization is performed.</p>
<p>We aim to mathematically define a utility function U such that we would be willing to build a hypothetical machine which exceptionlessly maximized U, possibly at the catastrophic expense of any other values. We will assume that the machine has an ability to reason which at least rivals that of humans, and is willing to tolerate arbitrarily complex definitions of U (within its ability to reason about them).</p>
<p>We adopt an indirect approach. Rather than specifying what exactly we want, we specify a process for determining what we want. This process is extremely complex, so that any computationally limited agent will always be uncertain about the process’ output. However, by reasoning about the process it is possible to make judgments about which action has the highest expected utility in light of this uncertainty.</p>
<p>For example, I might adopt the principle: “a state of affairs is valuable to the extent that I would judge it valuable after a century of reflection.” In general I will be uncertain about what I would say after a century, but I can act on the basis of my best guesses: after a century I will probably prefer worlds with more happiness, and so today I should prefer worlds with more happiness. After a century I have only a small probability of valuing trees’ feelings, and so today I should go out of my way to avoid hurting them if it is either instrumentally useful or extremely easy. As I spend more time thinking, my beliefs about what I would say after a century may change, and I will start to pursue different states of affairs even though the formal definition of my values is static. Similarly, I might desire to think about the value of trees’ feelings, if I expect that my opinions are unstable: if I spend a month thinking about trees, my current views will then be a much better predictor of my views after a hundred years, and if I know better whether or not trees’ feelings are valuable, I can make better decisions.</p>
<p>This example is quite informal, but it communicates the main idea of the approach. We stress that the value of our contribution, if any, is in the possibility of a precise formulation. (Our proposal itself will be relatively informal; instead it is a description of how you would arrive at a precise formulation.) The use of indirection seems to be necessary to achieve the desired level of precision.</p>
<h1>The Proposal</h1>
<p dir="ltr">Our proposal contains only two explicit steps:</p>
<ol>
<li>Obtain a precise mathematical characterization of a particular human’s brain.</li>
<li>Precisely define a completely abstract environment containing an idealized and unbounded computer. Adopt whatever utility function the human would decide on, if allowed free reign in this abstract environment. (With the technical modification that the utility is restricted to take on values between 0 and 1.)</li>
</ol>
<p dir="ltr">Each of these steps requires substantial elaboration, but we must also specify what we expect the human to do with these tools.</p>
<ol start="3">
<li>Build a community of copies of herself and an idealized environment for interaction. Deliberate on critical scientific and philosophical questions, and decide how to proceed (possibly in the spirit of what follows).</li>
<li>Search through the space of all possible worlds to find a simulation of our own. By interacting with this simulation, incorporate simulations of many living humans into her environment.</li>
<li>Continue to make scientific and philosophical progress, as safely as possible. Possibly engage in principled self-modification or understand and build machine intelligences which reflect relevant ethical principles.</li>
<li>Output a utility function. This function may be a compact set of ethical principles, or it may require many simulated humans or machine intelligences to apply their own (radically modified) common sense to evaluate states of affairs on a case-by-case basis.</li>
</ol>
<p dir="ltr">This proposal is best understood in the context of other fantastic-seeming proposals, such as “my utility is whatever I would write down if I reflected for a thousand years without interruption or biological decay.” The counterfactual events which take place within the definition are far beyond the realm our intuition recognizes as “realistic,” and have no place except in thought experiments. But to the extent that we can reason about these counterfactuals and change our behavior on the basis of that reasoning (if so motivated), we can already see how such fantastic situations could affect our more prosaic reality.</p>
<p>The remainder of this document consists of brief elaboration of some of these steps, and a few arguments about why this is a desirable process.</p>
<h1>Brain Emulation</h1>
<p dir="ltr">The first step of our proposal is a high-fidelity mathematical model of human cognition. We will set aside philosophical troubles, and assume that the human brain is a purely physical system which may be characterized mathematically. Even granting this, it is not clear how we can realistically obtain such a characterization.</p>
<p>The most obvious approach to characterizing a brain is to combine measurements of its behavior or architecture with an understanding of biology, chemistry, and physics. This project represents a massive engineering effort which is currently just beginning. Most pessimistically, our proposal could be postponed until this project’s completion. This could still be long before the mathematical characterization of the brain becomes useful for running experiments or automating human activities: because we are interested only in a definition, we do not care about having the computational resources necessary to simulate the brain.</p>
<p>An impractical mathematical definition, however, may be much easier to obtain. We can define a model of a brain in terms of exhaustive searches which could never be practically carried out. For example, given some observations of a neuron, we can formally define a brute force search for a model of that neuron. Similarly, given models of individual neurons we may be able to specify a brute force search over all ways of connecting those neurons which account for our observations of the brain (say, some data acquired through functional neuroimaging).</p>
<p>It may be possible to carry out this definition without exploiting any structural knowledge about the brain, beyond what is necessary to measure it effectively. By collecting imaging data for a human exposed to a wide variety of stimuli, we can recover a large corpus of data which must be explained by any model of a human brain. Moreover, by using our explicit knowledge of human cognition we can algorithmically generate an extensive range of tests which identify a successful simulation, by probing responses to questions or performance on games or puzzles.</p>
<p>In fact, this project may be possible using existing resources. The complexity of the human brain is not as unapproachable as it may at first appear: though it may contain 10<sup>14</sup> synapses, each described by many parameters, it can be specified much more compactly. A newborn’s brain can be specified by about 10<sup>9</sup> bits of genetic information, together with a recipe for a physical simulation of development. The human brain appears to form new long-term memories at a rate of 1-2 bits per second, suggesting that it may be possible to specify an adult brain using 10<sup>9</sup> additional bits of experiential information. This suggests that it may require only about 10<sup>10</sup> bits of information to specify a human brain, which is at the limits of what can be reasonably collected by existing technology for functional neuroimaging.</p>
<p>This discussion has glossed over at least one question: what do we mean by ‘brain emulation’? Human cognition does not reside in a physical system with sharp boundaries, and it is not clear how you would define or use a simulation of the “input-output” behavior of such an object.</p>
<p>We will focus on some system which does have precisely defined input-output behavior, and which captures the important aspects of human cognition. Consider a system containing a human, a keyboard, a monitor, and some auxiliary instruments, well-insulated from the environment except for some wires carrying inputs to the monitor and outputs from the keyboard and auxiliary instruments (and wires carrying power). The inputs to this system are simply screens to be displayed on the monitor (say delivered as a sequence to be displayed one after another at 30 frames per second), while the outputs are the information conveyed from the keyboard and the other measuring apparatuses (also delivered as a sequence of data dumps, each recording activity from the last 30th of a second).</p>
<p>This “human in a box” system can be easily formally defined if a precise description of a human brain and coarse descriptions of the human body and the environment are available. Alternatively, the input-output behavior of the human in a box can be directly observed, and a computational model constructed for the entire system. Let H be a mathematical definition of the resulting (randomized) function from input sequences (In(1), In(2), …, In(K)) to the next output Out(K). H is, by design, a good approximation to what the human “would output” if presented with any particular input sequence.</p>
<p>Using H, we can mathematically define what “would happen” if the human interacted with a wide variety of systems. For example, if we deliver Out(K) as the input to an abstract computer running some arbitrary software, and then define In(K+1) as what the screen would next display, we can mathematically define the distribution over transcripts which would have arisen if the human had interacted with the abstract computer. This computer could be running an interactive shell, a video game, or a messaging client.</p>
<p>Note that H reflects the behavior of a particular human, in a particular mental state. This state is determined by the process used to design H, or the data used to learn it. In general, we can control H by choosing an appropriate human and providing appropriate instructions / training. More emulations could be produced by similar measures if necessary. Using only a single human may seem problematic, but we will not rely on this lone individual to make all relevant ethical judgments. Instead, we will try to select a human with the motivational stability to carry out the subsequent steps faithfully, which will define U using the judgment of a community consisting of many humans.</p>
<p>This discussion has been brief and has necessarily glossed over several important difficulties. One difficulty is the danger of using computationally unbounded brute force search, given the possibility of short programs which exhibit goal-oriented behavior. Another difficulty is that, unless the emulation project is extremely conservative, the models it produces are not likely to be fully-functional humans. Their thoughts may be blurred in various ways, they may be missing many memories or skills, and they may lack important functionalities such as long-term memory formation or emotional expression. The scope of these issues depends on the availability of data from which to learn the relevant aspects of human cognition. Realistic proposals along these lines will need to accommodate these shortcomings, relying on distorted emulations as a tool to construct increasingly accurate models.</p>
<h1>The Virtual Environment</h1>
<p dir="ltr">For any idealized “software”, with a distinguished instruction return, we can use H to mathematically define the distribution over return values which would result, if the human were to interact with that software. We will informally define a particular program T which provides a rich environment, in which the remainder of our proposal can be implemented. From a technical perspective this will be the last step of our proposal. The remaining steps will be reflected only in the intentions and behavior of the human being simulated in H.</p>
<p>Fix a convenient and adequately expressive language (say a dialect of Python designed to run on an abstract machine). T implements a standard interface for an interactive shell in this language: the user can look through all of the past instructions that have been executed and their return values (rendered as strings) or execute a new instruction. We also provide symbols representing H and T themselves (as functions from sequences of K inputs to a value for the Kth output). We also provide some useful information (such as a snapshot of the Internet, and some information about the process used to create H and T), which we encode as a bit string and store in a single environment variable data. We assume that our language of choice has a return instruction, and we have T return whenever the user executes this instruction. Some care needs to be taken to define the behavior if T enters an infinite loop&#8211;we want to minimize the probability that the human accidentally hangs the terminal, with catastrophic consequences, but we cannot provide a complete safety-net without running into unresolvable issues with self-reference.</p>
<p>We define U to be the value returned by H interacting with T. If H represented an unfortunate mental state, then this interaction could be short and unproductive: the simulated human could just decide to type ‘return 0’ and be done with it. However, by choosing an appropriate human to simulate and inculcating an appropriate mental state, we can direct the process further.</p>
<p>We intend for H to use the resources in T to initiate a larger deliberative process. For example, the first step of this process may be to instantiate many copies of H, interacting with variants of messaging clients which are in contact with each other. The return value from the original process could then be defined as the value returned by a designated ‘leader’ from this community, or as a majority vote amongst the copies of H, or so on.  Another step might be to create appropriate realistic virtual environments for simulated brains, rather than confining them to boxes. For motivational stability, it may be helpful to design various coordination mechanisms, involving frameworks for interaction, “cached” mental states which are frequently re-instantiated, or sanity checks whereby one copy of H monitors the behavior of another.</p>
<p>The resulting communities of simulated brains then engage in a protracted planning process, ensuring that subsequent steps can be carried out safely or developing alternative approaches. The main priority of this community is to reduce the probability of errors as far as possible (exactly what constitutes an ‘error’ will be discussed at more length later). At the end of this process, we obtain a formal definition of a new protocol H+, which submits its inputs for consideration to a large community and then produces its outputs using some deliberation mechanism (democratic vote, one leader using the rest of the community as advisors, etc.)</p>
<h1>Finding Earth</h1>
<p dir="ltr">The next step requires our community of simulated brains to construct a detailed simulation of Earth which they can observe and manipulate. Once they have such a simulation, they have access to all of the data which would have been available on Earth. In particular, they can now explore many possible futures and construct simulations for each living human.</p>
<p>In order to locate Earth, we will again leverage an exhaustive search. First, H+ decides on informal desiderata for an “Earth simulation.” These are likely to be as follows:</p>
<ul>
<li>Comprehensible: the code for the simulation should relate to the best understanding of physics which H+ can develop. This may require operating at a coarser level than the “real” physics underlying our world, and it probably requires structuring the code for the simulation in a way that is not maximally efficient. This requirement is designed to ensure that the ‘views’ and ‘interventions’ produced below are acting the expected way when they are being tested for accuracy, so that generalization to new situations will work as expected.</li>
<li>Supports viewing: H+ should decide on an event to use as a landmark (such as a particular data entry process involved in the definition of H), and search for simulations which support viewing scenes defined relative to that landmark. For example, they might require the simulation to provide a function View(x, y, z, t), which provides a “human’s view” from a point specified by the coordinates (x, y, z, t), defined relative to the landmark.</li>
<li>Accurate: the simulation should consistently conform to expectations about Earth. The history of the individual simulated in H should conform to H’s simulated memories; the simulation should follow the same broad historical course as described in data and remembered by H; all of the other data (for example, the complete snapshot of the Internet) should also be in agreement. Perhaps most importantly, the details of whatever system’s behavior depends on U should be recorded as part of the data.</li>
<li>Supports intervention: H+ should decide on a particular class of interventions (such as non-physically setting the output of a particular invocation of one program to be equal to S), and search for simulations which support intervening in that way. For example, they might require the simulation to provide a function Modify(S) which causes all future View() instructions to show what would have happened if the output of the intervention corresponding to S had been performed.</li>
</ul>
<p dir="ltr">Once H+ has decided on the desiderata, it uses a brute force search to find a simulation satisfying them: for each possible program it instantiates a new copy of H+ tasked with evaluating whether that program is an acceptable simulation. We then define E to be a uniform distribution over programs which pass this evaluation.</p>
<p>We might have doubts about whether this process produces the “real” Earth&#8211;perhaps even once we have verified that it is identical according to a laundry list of measures, it may still be different in other important ways. There are two reasons why we might care about such differences. First, if the simulated Earth has a substantially different set of people than the real Earth, then a different set of people will be involved in the subsequent decision making. If we care particularly about the opinions of the people who actually exist (which the reader might well, being amongst such people!) then this may be unsatisfactory. Second, if events transpire significantly differently on the simulated Earth than the real Earth, value judgments designed to guide behavior appropriately in the simulated Earth may lead to less appropriate behaviors in the real Earth. (This will not be a problem if our ultimate definition of U consists of universalizable ethical principles, but we will see that U might take other forms.)</p>
<p>These concerns are addressed by a few broad arguments. First, checking a detailed but arbitrary ‘laundry list’ actually provides a very strong guarantee. For example, if this laundry list includes verifying a snapshot of the Internet, then every event or person documented on the Internet must exist unchanged, and every keystroke of every person composing a document on the Internet must not be disturbed. If the world is well interconnected, then it may be very difficult to modify parts of the world without having substantial effects elsewhere, and so if a long enough arbitrary list of properties is fixed, we expect nearly all of the world to be the same as well. Second, if the essential character of the world is fixed but detailed are varied, we should expect the sort of moral judgments reached by consensus to be relatively constant. Finally, if the system whose behavior depends on these moral judgments is identical between the real and simulated worlds, then outputting a U which causes that system to behave a certain way in the simulated world will also cause that system to behave that way in the real world.</p>
<p>Once H+ has defined a simulation of the world which permits inspection and intervention, by careful trial and error H+ can inspect a variety of possible futures. In particular, they can find interventions which cause the simulated human society to conduct a real brain emulation project and produce high-fidelity brain scans for all living humans.</p>
<p>Once these scans have been obtained, H+ can use them to define U as the output of a new community, H++, which draws on the expertise of all living humans operating under ideal conditions. There are two important degrees of flexibility: how to arrange the community for efficient communication and deliberation, and how to delegate the authority to define U. In terms of organization, the distinction between different approaches is probably not very important. For example, it would probably be perfectly satisfactory to start from a community of humans interacting with each other over something like the existing Internet (but on abstract, secure infrastructure). More important are the safety measures which would be in place, and the mechanism for resolving differences of value between different simulated humans.</p>
<p>The basic approach to resolving disputes is to allow each human to independently create a utility function U, each bounded in the interval [0, 1], and then to return their average. This average can either be unweighted, or can be weighted by a measure of each individual’s influence in the real world, in accordance with a game-theoretic notion like the Shapley value applied to abstract games or simulations of the original world. More sophisticated mechanisms are also possible, and may be desirable. Of course these questions can and should be addressed in part by H+ during its deliberation in the previous step. After all, H+ has access to an unlimited length of time to deliberate and has infinitely powerful computational aids. The role of our reasoning at this stage is simply to suggest that we can reasonably expect H+ to discover effective solutions.</p>
<p>As when discussing discovering a brain simulation by brute force, we have skipped over some critical issues in this section. In general, brute force searches (particularly over programs which we would like to run) are quite dangerous, because such searches will discover many programs with destructive goal-oriented behaviors. To deal with these issues, in both cases, we must rely on patience and powerful safety measures.</p>
<h1>Extrapolation</h1>
<p dir="ltr">Once we have a formal description of a community of interacting humans, given as much time as necessary to deliberate and equipped with infinitely powerful computational aids, it becomes increasingly difficult to make coherent predictions about their behavior. Critically, though, we can also become increasingly confident that the outcome of their behavior will reflect their intentions. We sketch some possibilities, to illustrate the degree of flexibility available.</p>
<p>Perhaps the most natural possibility is for this community to solve some outstanding philosophical problems and to produce a utility function which directly captures their preferences. However, even if they quickly discovered a formulation which appeared to be attractive, they would still be wise to spend a great length of time and to leverage some of these other techniques to ensure that their proposed solution was really satisfactory.</p>
<p>Another natural possibility is to eschew a comprehensive theory of ethics, and define value in terms of the community’s judgment. We can define a utility function in terms of the hypothetical judgments of astronomical numbers of simulated humans, collaboratively evaluating the goodness of a state of affairs by examining its history at the atomic level, understanding the relevant higher-order structure, and applying human intuitions.</p>
<p>It seems quite likely that the community will gradually engage in self-modifications, enlarging their cognitive capacity along various dimensions as they come to understand the relevant aspects of cognition and judge such modifications to preserve their essential character. Either independently or as an outgrowth of this process, they may (gradually or abruptly) pass control to machine intelligences which they are suitably confident expresses their values. This process could be used to acquire the power necessary to define a utility function in one of the above frameworks, or understanding value-preserving self-modification or machine intelligence may itself prove an important ingredient in formalizing what it is we value. Any of these operations would be performed only after considerable analysis, when the original simulated humans were extremely confident in the desirability of the results.</p>
<p>Whatever path they take and whatever coordination mechanisms they use, eventually they will output a utility function U’. We then define U = 0 if U’ &lt; 0, U = 1 if U’ &gt; 1, and U = U’ otherwise.</p>
<h1>Desirability</h1>
<p dir="ltr">At this point we have offered a proposal for formally defining a function U. We have made some general observations about what this definition entails. But now we may wonder to what extent U reflects our values, or more relevantly, to what extent our values are served by the creation of U-maximizers. Concerns may be divided into a few natural categories:</p>
<ul>
<li>Even if the process works as intended, the ultimate intentions of the simulations within the process do not reflect our values, so we should not expect them to output a U which reflects our values.</li>
<li>The process has some chance of failing to work as intended. If it fails completely, then value will certainly be lost. If it fails with some small probability (over the stochasticity within the process, or over our uncertainty about its behavior) then the resulting utility function may be substantially altered and desirable outcomes may no longer be achieved.</li>
<li>Any real U-maximizer will be unable to actually carry out the simulation described in the definition of U, and so even if this process would produce a U reflecting our values, it is not clear how a real U-maximizer will behave.</li>
<li>Because we have started from the standpoint of bounded expected utility maximization we have ruled out the vast majority of possible value systems. It may be that there is no way to “shorehorn” our values into this framework, so that no matter what bounded U we choose, the resulting agent doesn’t satisfy our values very well.</li>
<li>Are there possible negative consequences to “passing the buck” as in this proposal?</li>
<li>Could the process itself be morally abhorrent?</li>
</ul>
<p dir="ltr">We respond to each of these objections in turn.</p>
<h2>If it Works as Intended, Will This Process Reflect our Values?</h2>
<p dir="ltr">If the process works as intended, we will reach a stage in which a large community of humans reflects on their values, undergoes a process of discovery and potentially self-modification, and then outputs its result. We may be concerned that this dynamic does not adequately capture what we value.</p>
<p>For example, we may believe that some other extrapolation dynamic captures our values, or that it is morally desirable to act on the basis of our current beliefs without further reflection, or that the presence of realistic disruptions, such as the threat of catastrophe, has an important role in shaping our moral deliberation.</p>
<p>The important observation, in the defense of our proposal, is that whatever objections we could think of today, we could think of within the simulation. If, upon reflection, we decide that too much reflection is undesirable, we can simply change our plans appropriately. If we decide that realistic interference is important for moral deliberation, we can construct a simulation in which such interference occurs, or determine our moral principles by observing moral judgments in our own world’s possible futures.</p>
<p>There is some chance that this proposal is inadequate for some reason which won’t be apparent upon reflection, but then by definition this is a fact which we cannot possibly hope to learn by deliberating now. It therefore seems quite difficult to maintain objections to the proposal along these lines.</p>
<p>One aspect of the proposal does get “locked in,” however, after being considered by only one human rather than by a large civilization: the distribution of authority amongst different humans, and the nature of mechanisms for resolving differing value judgments.</p>
<p>Here we have two possible defenses. One is that the mechanism for resolving such disagreements can be reflected on at length by the individual simulated in H. This individual can spend generations of subjective time, and greatly expand her own cognitive capacities, while attempting to determine the appropriate way to resolve such disagreements. However, this defense is not completely satisfactory: we may be able to rely on this individual to produce a very technically sound and generally efficient proposal, but the proposal itself is quite value laden and relying on one individual to make such a judgment is in some sense begging the question.</p>
<p>A second, more compelling, defense, is that the structure of our world has already provided a mechanism for resolving value disagreements. By assigning decision-making weight in a way that depends on current influence (for example, as determined by the simulated ability of various coalitions to achieve various goals), we can generate a class of proposals which are at a minimum no worse than the status quo. Of course, these considerations will also be shaped by the conditions surrounding the creation or maintenance of systems which will be guided by U&#8211;for example, if a nation were to create a U-maximizer, they might first adopt an internal policy for assigning influence on U.  By performing this decision making in an idealized environment, we can also reduce the likelihood of destructive conflict and increase the opportunities for mutually beneficial bargaining. We may have moral objections to codifying this sort of “might makes right” policy, favoring a more democratic proposal or something else entirely, but as a matter of empirical fact a more ‘cosmopolitan’ proposal will be adopted only if it is supported by those with the appropriate forms of influence, a situation which is unchanged by precisely codifying existing power structure.</p>
<p>Finally, the values of the simulations in this process may diverge from the values of the original human models, for one reaosn or another. For example, the simulated humans may predictably disagree with the original models about ethical questions by virtue of (probably) having no physical instantiation. That is, the output of this process is defined in terms of what a particular human would do, in a situation which that human knows will never come to pass. If I ask “What would I do, if I were to wake up in a featureless room and told that the future of humanity depended on my actions?” the answer might begin with “become distressed that I am clearly inhabiting a hypothetical situation, and adjust my ethical views to take into account the fact that people in hypothetical situations apparently have relevant first-person experience.” Setting aside the question of whether such adjustments are justified, they at least raise the possibility that our values may diverge from those of the simulations in this process.</p>
<p>These changes might be minimized, by understanding their nature in advance and treating them on a case-by-case basis (if we can become convinced that our understanding is exhaustive). For example, we could try and use humans who robustly employ updateless decision theories which never undergo such predictable changes, or we could attempt to engineer a situation in which all of the humans being emulated do have physical instantiations, and naive self-interest for those emulations aligns roughly with the desired behavior (for example, by allowing the early emulations to “write themselves into” our world).</p>
<h2>Will This Process Work as Intended?</h2>
<p dir="ltr">We can imagine many ways in which this process can fail to work as intended&#8211;the original brain emulations may accurately model human behavior, the original subject may deviate from the intended plans, or simulated humans can make an error when interacting with their virtual environment which causes the process to get hijacked by some unintended dynamic.</p>
<h3>Robustness</h3>
<p>We can argue that the proposal is likely to succeed, and can bolster the argument in various ways (by reducing the number of assumptions necessary for succees, building in fault-tolerance, justifying each assumption more rigorously, and so on). However, we are unlikely to eliminate the possibility of error. Therefore we need to argue that if the process fails with some small probability, the resulting values will only be slightly disturbed.</p>
<p>This is the reason for requiring U to lie in the interval [0, 1]&#8211;we will see that this restriction bounds the damage which may be done by an unlikely failure.</p>
<p>If the process fails with some small probability ε, then we can represent the resulting utility function as U = (1 &#8211; ε) U1 + ε U2, where U1 is the intended utility function and U2 is a utility function produced by some arbitrary error process. Now consider two possible states of affairs A and B such that U1(A) &gt; U1(B) + ε /(1 &#8211; ε) ≈ U1(B) + ε. Then since 0 ≤  U2 ≤ 1, we have:</p>
<p dir="ltr">U(A) = (1 &#8211; ε) U1(A) + ε U2(A) &gt; (1 &#8211; ε) U1(B) + ε ≥ (1 &#8211; ε) U1(B) + ε U2(B) = U(B)</p>
<p dir="ltr">Thus if A is substantially better than B according to U1, then A is better than B according to U. This shows that a small probability of error, whether coming from the stochasticity of our process or an agent’s uncertainty about the process’ output, has only a small effect on the resulting values.</p>
<p>Moreover, the process contains a humans who have access to a simulation of our world. This implies, in particular, that they have access to a simulation of whatever U-maximizing agents exist in the world, and they have knowledge of those agents’ beliefs about U. This allows them to choose U with perfect knowledge of the effects of error in these agents’ judgments.</p>
<p>In some cases this will allow them to completely negate the effect of error terms. For example, if the randomness in our process causes a perfectly cooperate community of simulated humans to “control” U with probability ⅔, and causes an arbitrary adversary to control it with probability ⅓, then the simulated humans can spend half of their mass outputting a utility function which exactly counters the effect of the adversary.</p>
<p>In general, the situation is not quite so simple: the fraction of mass controlled by any particular coalition will vary as the system’s uncertainty about U varies, and so it will be impossible to counteract the effect of an error term in a way which is time-independent.  Instead, we will argue later that an appropriate choice of a bounded and noisy U can be used to achieve a very wide variety of effective behaviors of U-maximizers, overcoming the limitations both of bounded utility maximization and of noisy specification of utility functions.</p>
<h3>Other errors</h3>
<p>Many possible problems with this scheme were described or implicitly addressed above. But that discussion was not exhaustive, and there are some classes of errors that fall through the cracks.</p>
<p>One interesting class of failures concerns changes in the values of the hypothetical human H. This human is in a very strange situation, and it seems quite possible that the physical universe we know contains extremely few instances of that situation (especially as the process unfolds and becomes more exotic). So H&#8217;s first-person experience of this situation may lead to significant changes in H&#8217;s views.</p>
<p>For example, our intuition that our own universe is valuable seems to be derived substantially from our judgment that our own first-person experiences are valuable. If hypothetically we found ourselves in a very alien universe, it seems quite plausible that we would judge the experiences within that universe to be morally valuable as well (depending perhaps on our initial philosophical inclinations).</p>
<p>Another example concerns our self-interest: much of individual humans&#8217; values seem to depend on their own anticipations about what will happen to them, especially when faced with the prospect of very negative outcomes. If hypothetically we woke up in a completely non-physical situation, it is not exactly clear what we would anticipate, and this may distort our behavior. Would we anticipate the planned thought experiment occurring as planned? Would we focus our attention on those locations in the universe where a simulation of the thought experiment might be occurring? This possibility is particularly troubling in light of the<a href="http://ordinaryideas.wordpress.com/2012/05/24/solomonoff-induction-and-simulations/"> incentives </a>our scheme creates&#8212;anyone who can manipulate H&#8217;s behavior can have a significant effect on the future of our world, and so many may be motivated to create simulations of H.</p>
<h2>How Will a U-Maximizer Behave In Light of Uncertainty About U?</h2>
<p dir="ltr">A realistic U-maximizer will not be able to carry out the process described in the definition of U&#8211;in fact, this process probably requires immensely more computing resources than are available in the universe. (It may even involve the reaction of a simulated human to watching a simulation of the universe!) To what extent can we make robust guarantees about the behavior of such an agent?</p>
<p>We have already touched on this difficulty when discussing the maxim “A state of affairs is valuable to the extent I would judge it valuable after a century of reflection.” We cannot generally predict our own judgments in a hundred years’ time, but we can have well-founded beliefs about those judgments and act on the basis of those beliefs. We can also have beliefs about the value of further deliberation, and can strike a balance between such deliberation and acting on our current best guess.</p>
<p>A U-maximizer faces a similar set of problems: it cannot understand the exact form of U, but it can still have well-founded beliefs about U, and about what sorts of actions are good according to U. For example, if we suppose that the U-maximizer can carry out any reasoning that we can carry out, then the U-maximizer knows to avoid anything which we suspect would be bad according to U (for example, torturing humans). Even if the U-maximizer cannot carry out this reasoning, as long as it can recognize that humans have powerful predictive models for other humans, it can simply appropriate those models (either by carrying out reasoning inspired by human models, or by simply asking).</p>
<p>Moreover, the community of humans being simulated in our process has access to a simulation of whatever U-maximizer is operating under this uncertainty, and has a detailed understanding of that uncertainty. This allows the community to shape their actions in a way with predictable (to the U-maximizer) consequences.</p>
<h2>Can Our Values be Expressed in This Framework?</h2>
<p dir="ltr">It is easily conceivable that our values cannot be captured by a bounded utility function. Easiest to imagine is the possibility that some states of the world are much better than others, in a way that requires unbounded utility functions. But it is also conceivable that the framework of utility maximization is fundamentally not an appropriate one for guiding such an agent’s action, or that the notion of utility maximization hides subtleties which we do not yet appreciate.</p>
<p>We will argue that it is possible to transform bounded utility maximization into an arbitrary alternative system of decision-making, by designing a utility function which rewards worlds in which the U-maximizer replaced itself with an alternative decision-maker.</p>
<p>It is straightforward to design a utility function which is maximized in worlds where any particular U-maximizer converted itself into a non-U-maximizer&#8211;even if no simple characterization can be found for the desired act, we can simply instantiate many communities of humans to look over a world history and decide whether or not they judge the U-maximizer to have acted appropriately.</p>
<p>The more complicated question is whether a realistic U-maximizer can be made to convert itself into a non-U-maximizer, given that it is logically uncertain about the nature of U. It is at least conceivable that it couldn’t: if the desirability of some other behavior is only revealed by philosophical considerations which are too complex to ever be discovered by physically limited agents, then we should not expect any physically limited U-maximizer to respond to those considerations. Of course, in this case we could also not expect normal human deliberation to correctly capture our values. The relevant question is whether a U-maximizer could switch to a different normative framework, if an ordinary investment of effort by human society revealed that a different normative framework was more appropriate.</p>
<p>If a U-maximizer does not spend any time investigating this possibility, than it may not be expected to act on it. But to the extent that we assign a significant probability to the simulated humans deciding that a different normative framework is more appropriate, and to the extent that the U-maximizer is able to either emulate or accept our reasoning, it will also assign a significant probability to this possibility (unless it is able to rule it out by more sophisticated reasoning). If we (and the U-maximizer) expect the simulations to output a U which rewards a switch to a different normative framework, and this possibility is considered seriously, then U-maximization entails exploring this possibility. If these explorations suggest that the simulated humans probably do recommend some particular alternative framework, and will output a U which assigns high value to worlds in which this framework is adopted and low value to worlds in which it isn’t, then a U-maximizer will change frameworks.</p>
<p>Such a “change of frameworks” may involve sweeping action in the world. For example, the U-maximizer may have created many other agents which are pursuing activities instrumentally useful to maximizing U. These agents may then need to be destroyed or altered; anticipating this possibility, the U-maximizer is likely to take actions to ensure that its current “best guess” about U does not get locked in.</p>
<p>This argument suggests that a U-maximizer could adopt an arbitrary alternative framework, if it were feasible to conclude that humans would endorse that framework upon reflection.</p>
<h2>Is “Passing the Buck” Problematic?</h2>
<p dir="ltr">Our proposal appears to be something of a cop out, in that it declines to directly take a stance on any ethical issues. Indeed, not only do we fail to specify a utility function ourselves, but we expect the simulations to which we have delegated the problem to in turn delegate it at least a few more times. Clearly at some point this process must bottom out with actual value judgments, and we may be concerned that this sort of “passing the buck” is just obscuring deeper problems which will arise when the process does bottom out.</p>
<p>As observed above, whatever such concerns we might have can also be discovered by the simulations we create. If there is some fundamental difficulty which always arises when trying to assign values, then we certainly have not exacerbated this problem by delegation. Nevertheless, there are at least two coherent objections one might raise:</p>
<ul>
<li>Even if the simulated humans can uncover any objections we could raise now, this does not guarantee that we can ignore all objections. After all, the objection “Isn’t passing the buck problematic?” could be raised at every stage, and always countered by the same response: “If it is problematic, then this will be realized by the people to whom we have passed the buck.” If we do not take this objection seriously, then it may be that none of the delegates take it seriously either, and this call and response could be repeated indefinitely.</li>
<li>This proposal could fail in many (potentially unexpected) ways. If fundamentally resolving ethics requires overcoming some difficulties which we are evading by passing the buck, then we may be adding additional risk without buying much benefit.</li>
</ul>
<p dir="ltr">Both of these objections can be met with a single response. In the current world, we face a broad range of difficult and often urgent problems. By passing the buck the first time, we delegate resolution of ethical challenges to a civilization which does not have to deal with some of these difficulties&#8211;in particular, it faces no urgent existential threats. This allows us to divert as much energy as possible to dealing with practical problems today, while still capturing most of the benefits of nearly arbitrarily extensive ethical deliberation.</p>
<h2>Does This Process Have Moral (Dis)Value?</h2>
<p dir="ltr">This process is defined in terms of the behavior of unthinkably many hypothetical brain emulations. It is conceivable that the moral status of these emulations may be significant.</p>
<p>We must make a distinction between two possible sources of moral value: it could be the case that a U-maximizer carries out simulations on physical hardware in order to better understand U, and these simulations have moral value, or it could be the case that the hypothetical emulations themselves have moral value.</p>
<p>In the first case, we can remark that the moral value of such simulations is itself incorporated into the definition of U. Therefore a U-maximizer will be sensitive to the possible suffering of simulations it runs while trying to learn about U&#8211;as long as it believes that we may might be concerned about the simulations’ welfare, upon reflection, it can rely as much as possible on approaches which do not involve running simulations, which deprive simulations of the first-person experience of discomfort, or which estimate outcomes by running simulations in more pleasant circumstances. If the U-maximizer is able to foresee that we will consider certain sacrifices in simulation welfare worthwhile, then it will make those sacrifices. In general, in the same way that we can argue that estimates of U reflect our values over states of affairs, we can argue that estimates of U reflects our values over processes for learning about U.</p>
<p>In the second case, a U-maximizer in our world may have little ability to influence the welfare of hypothetical simulations invoked in the definition of U. However, the possible disvalue of these simulations’ experiences are probably seriously diminished.</p>
<p>In general the moral value of such hypothetical simulations’ experiences is somewhat dubious. If we simply write down the definition of U, these simulations seem to have no more reality than story-book characters whose activities we describe.</p>
<p>The best arguments for their moral relevance comes from the great causal significance of their decisions: if the actions of a powerful U-maximizer depend on its beliefs about what a particular simulation would do in a particular situation, including for example that simulation’s awareness of discomfort or fear, or confusion at the absurdity of the hypothetical situation in which they find themselves, then it may be the case that those emotional responses are granted moral significance. However, although we may define astronomical numbers of hypothetical simulations, the detailed emotional responses of very view of these simulations will play an important role in the definition of U.</p>
<p>Moreover, for the most part the existences of the hypothetical simulations we define are extremely well-controlled by those simulations themselves, and may be expected to be counted as unusually happy by the lights of the simulations themselves. The early simulations (who have less such control) are created from an individual who has provided consent and is selected to find such situations particularly non-distressing.</p>
<p>Finally, we observe that U can exert control over the experiences of even hypothetical simulations. If the early simulations would experience morally relevant suffering because of their causal significance, but the later simulations they generate robustly disvalue this suffering, the later simulations can simulate each other and ensure that they all take the same actions, eliminating the causal significance of the earlier simulations.</p>
<h1></h1><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/121/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=121&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Formal Instructions Revisited</title>
		<link>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/</link>
		<comments>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/#comments</comments>
		<pubDate>Mon, 02 Jan 2012 06:16:50 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=108</guid>
		<description><![CDATA[Previously I&#8217;ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form &#8220;Simulate this civilization.&#8221;  Here is a much better proposal for issuing formal instructions. Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=108&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Previously I&#8217;ve talked about getting a handle on objects of interest (humans, counterfactual civilizations) and on issuing an instruction of the form &#8220;Simulate <em>this</em> civilization.&#8221;  Here is a much better proposal for issuing formal instructions.</p>
<p>Suppose we can build a something-like-TDT agent, which controls its behavior to maximize a constant U defined by some axioms (or as the output of some program).  We want to run this agent with a utility function reflecting our own preferences, but we don&#8217;t have (formal) access to those preferences.<span id="more-108"></span></p>
<p>Suppose further that we can somehow define mathematically a counterfactual utopia&#8211;a possible world in which humans are likely to successfully complete whatever moral project we might optimistically hope for and build an FAI they are happy with. For example, perhaps we can mathematically pin down a brain emulation and describe algorithmically a (computationally unrestricted) environment in which the brain emulation can have free reign, and will have the ability to safely carry out extensive projects and instantiate other uploads.  We can provide the resulting civilization with a distinguished input channel and describe mathematically the distribution over possible messages they send on that channel.</p>
<p>Let P be the distribution over messages sent on this channel. Some messages can be interpreted as axioms describing potential utility U (or programs outputting real numbers). Since we can describe P mathematically, we can define a distribution over utilities U by conditioning P on being a valid description of utility and interpreting it in that way. Once we&#8217;ve described this probability distribution, we can specify our original agent&#8217;s utility as the expected utility if we use a definition drawn from this distribution.</p>
<p>With luck, the resulting utility function is as good a choice as we could have come up with, even if we had been given an eternity with no risk of catastrophe The civilization could produce a utility function expressing ideas like &#8220;do no harm, but do this instrumentally useful thing which will allow the original humans to build a real FAI,&#8221; for example. More optimistically, they could just find an appropriate dynamic directly which expresses human preferences (probably defined in terms of a mathematical description of human brains, or perhaps a catalog of all humans existing at the time the AI is run, or perhaps a description of agents from within this counterfactual utopia, or whatever).</p>
<p>There are a few obvious problems:</p>
<ul>
<li>With some probability the civilization will fail (e.g. be taken over by uFAI) and will output some utility function designed to take over our universe. To prevent this sort of failure, we should require each utility function to be bounded, say to have range [0, 1]. If we make this restriction and 99% of the mass of P lies on &#8220;good&#8221; utility functions, then we are guaranteed to get an outcome which maximizes a &#8220;good&#8221; utility function to within 0.01. (Of course, now we are stuck with a bounded utility function, but this is not such a serious issue since satisfying a bounded utility function may create an agent which maximizes an unbounded utility function. Here the bounded utility acts as a sort of &#8220;majority rules,&#8221; rather than an inherent obstruction)</li>
<li>There may be no utility function which we want to plug into such an AI. This framework may just be a bad one, in which it is not possible to even express an idea like &#8220;do no harm,&#8221; much less human value (or, our decision theory may cause the AI to behave badly regardless of utility function for other reasons). In this case we have more serious trouble, but at least we would have had serious trouble anyway.</li>
<li>Simulating the counterfactual utopia is likely to be much too computationally expensive for the AI to simulate using available resources. This isn&#8217;t really a problem, as the AI is encouraged to just take its &#8220;best guess,&#8221; understanding the mathematical/algorithmic definition of utility even if it cannot simulate the world in the definition&#8211;to the extent that human values cohere (in whatever abstract sense we might hope for) the AI can predict that some actions will be good without obtaining an exact statement of human utility.</li>
<li>Defining a good utility U may depend on understanding facts about our universe. This seems unlikely, as the counterfactual utopia may run huge simulations of varied laws of physics and search for copies of their remembered history within those simulations (or output a utility function that does the same). So if this consideration bites us (say because our laws of physics are radically different than we expect) we will first probably run into the next problem:</li>
<li>Our formal definitions of brain emulations, counterfactuals, moral bootstrapping, or whatever, may break down, for any one of many possible reasons.</li>
</ul><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/108/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=108&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2012/01/02/formal-instructions-revisited/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Formal Instructions</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/21/edict-to-simulate/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/21/edict-to-simulate/#comments</comments>
		<pubDate>Wed, 21 Dec 2011 20:09:02 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=69</guid>
		<description><![CDATA[I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like &#8220;Now take this decision process, embed it [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=69&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have written briefly about how one might pin down the human decision process (the thing itself, not some idealization thereof) or a counterfactual world. If we (probably foolishly) wanted to give an AI formal instructions using these ideas, we would still need to include some edict like &#8220;Now take this decision process, embed it in this abstract world (where we believe it will be able to create a flourishing utopia or whatever) in this way, and make the universe look like <em>that</em>.&#8221; We&#8217;ve maybe gotten some leverage on the first parts (though right now the difficulties <a href="http://ordinaryideas.wordpress.com/2011/12/15/hazards/">here</a> loom pretty large), which involve precisely defining certain concepts for an AI, but it isn&#8217;t yet clear how you could precisely tell the AI to <em>do</em> something. Here is a stab at this other problem.</p>
<p>Rather than directly asking an AI to simulate a particular universe, we will ask it to find the value on a particular physical input channel in that universe, and then exert control from within the universe to ensure that calculating this value requires simulating the universe (or at least capturing whatever moral value we hope would come from a simulation of that universe).<span id="more-69"></span></p>
<p>For example, suppose that we have an AI which somehow uses a human&#8217;s decision process to define a counterfactual world W (in which some specified quantum coins happened to fall in a particular way controlled by [an extrapolation of] the human&#8217;s decision process; or perhaps in which the laws of physics are completely synthetic and the human decision process has the freedom to construct a desirable world), and we would like the AI to make the universe &#8220;look like W.&#8221; We imagine that the world W has a distinguished input channel I within it (defined as the simplest program which agrees with some initial values on a particular physical input channel). Moreover, the real universe also contains an input channel I&#8217;, defined in the same way.</p>
<p>In order to instruct the AI to make the world look like W, we tell it to value similarity between the values on I and I&#8217;, say to have utility equal to the longest common prefix between I and any suffix of I&#8217;. If we (living within W) control the value on I appropriately, this may lead the AI to simulate W, determine what values we would have put on I, and then put the same values on I&#8217;.</p>
<p>How would we want to set the values of I? We want to ensure that, in order to make I and I&#8217; equal, the AI will simulate W and then change the values on I&#8217;, rather than running some other process to guess the values on I or trying to acausally control the value on I itself. This requires first that friendly processes control I (for example, if a predictable process or the AI itself gains control of I, the AI no longer needs to simulate W to make I and I&#8217; agree). But it also requires that there be no computational shortcuts to computing the values on I&#8211;so that the AI is required to simulate W in order to compute the values on I. This actually seems rather straightforward: by making the values of I depend in a sufficiently chaotic way on the rest of the universe, it will be necessary to conduct an exact simulation of W in order to make I&#8217; remotely similar to I. We also need to ensure that it is possible for the AI to make I&#8217; and I similar, or else it will recourse to some more desperate measures. This will hold if we ensure that the world W is sufficiently simple that it can be simulated in the real world&#8211;say by creating a giant computer within W and simulating our civilization on that computer, then making I depend only the behavior of that simulation.</p>
<p>The following approach seems to be very conservative, supposing that consciousness is a property of computation and that we put moral value primarily on the content of conscious experience (it should be clear how to generalize to other settings, and these choices can all be made after much deliberation within W). Each conscious process which wants to ensure it gets simulated computes a cryptographic hash of some large subset of its own computation history. So e.g. I record all of the heat coming off of my brain and hash it to obtain a 10000 bit summary of my conscious experience. We then hash all of these histories together (using a Merkle tree if you like) and produce a 10000 bit summary of the universe&#8217;s moral value, and send this over the channel I. We repeat this process periodically, encouraging an AI to try to simulate as much of our collective history as possible before exhausting available resources. If the cryptographic hash function is collision-resistant against the universe (based on our current understanding of complexity theory we suspect such functions exist), this has the property that no single agent in W can substantially influence the value of I, and moreover that an AI trying to make I&#8217; similar to I can only do so by computing all of the inputs to this process&#8211;ie, by replicating the entire history of all conscious processes which we value.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/69/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=69&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/21/edict-to-simulate/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Specifying (non-decision-theoretic) Counterfactuals</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/15/specifying-non-decision-theoretic-counterfactuals/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/15/specifying-non-decision-theoretic-counterfactuals/#comments</comments>
		<pubDate>Thu, 15 Dec 2011 23:27:33 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=25</guid>
		<description><![CDATA[Here is a simple trick for specifying a computer in the physical world&#8217;s future inputs: run the computer for a long time, and then ask for the simplest description of the resulting sequence of inputs. The resulting description is a good predictor for future inputs, provided we live in a suitable universe. (This is vulnerable [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=25&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Here is a simple trick for specifying a computer in the physical world&#8217;s future inputs: run the computer for a long time, and then ask for the simplest description of the resulting sequence of inputs. The resulting description is a good predictor for future inputs, provided we live in a suitable universe.</p>
<p>(This is vulnerable to all of the same attacks defined in &#8220;Hazards,&#8221; and if we really want to get access to the universe as a whole, rather than just to a simulation of a single brain, it will be much harder to get around these problems.)</p>
<p>Now suppose we have a single bit X on a computer, and we would like to talk about the counterfactual world in which X&#8217;s value was flipped. How can we do this? Or perhaps we would like to consider an entire ensemble of possible counterfactuals in which we were given one of exponentially many possible messages m1, m2, &#8230;.</p>
<p><span id="more-25"></span></p>
<p>Here is one approach: consider a sequence of regularly spaced quantum coin flips F1, F2, &#8230;.. Perform each of these flips, and record the results in a list L. Continue running the computer and acquiring input I, and ensure that the result couples closely enough with the coin flips (e.g. by looking at sufficiently chaotic systems influenced by the coins, by looking at the coins directly, etc.). Then ask for the simplest function f : X -&gt; Y such that f(L) = I. If the flips contained substantially more information than required to locate the sequence of flips in the universe, then the universe can be most concisely described by locating these distinguished coin flips and using L to determine their values (rather than by choosing them randomly, as the normal laws of physics would entail). By plugging in some other sequence of bits and asking for the computer&#8217;s inputs, we get access to the computer&#8217;s inputs in counterfactual worlds where the coin flips turned out differently.</p>
<p>By using these results in decision-making we can examine the results of counterfactual decisions. By using them cleverly to control input to a human, we can get around the first problem described in the post &#8220;Hazards.&#8221; But note that the other problems there still apply in force.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/25/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=25&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/15/specifying-non-decision-theoretic-counterfactuals/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Hazards for Formal Specifications</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/15/hazards/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/15/hazards/#comments</comments>
		<pubDate>Thu, 15 Dec 2011 23:03:39 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=23</guid>
		<description><![CDATA[I have described a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human&#8217;s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=23&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have <a href="http://ordinaryideas.wordpress.com/2011/12/14/specifying-humans-formally-using-an-oracle-for-physics/">described</a> a candidate scheme for mathematically pinpointing the human decision process, by conditioning the univeral prior on agreement with the human&#8217;s observed behavior. I would like to point out three dangers with this approach, which seem to apply quite generally to attempts to mathematically specify value (and have analogs for other aspects of agents&#8217; behavior):</p>
<p><span id="more-23"></span></p>
<h2>Counterfactuals</h2>
<p>Suppose for example that we are in a deterministic universe with a short description. Then the shortest description of the &#8220;human&#8217;s behavior&#8221; is likely to consist of a description of the universe together with a pointer to the measuring apparatus within that universe. This description is short and correct, and will correctly predict the future behavior of the human on the actual input sequence.</p>
<p>But rather than learning a stand-alone description for the human behavior, we wanted to learn the abstract map which related the displayed input sequence to the human&#8217;s behavior, with an interest in examining the response of the human to counterfactual input sequences. We have now failed completely at this task: the learned function effectively has the actual input sequence wired in, computed directly from physical law. This is closely related to the issues raised in &#8220;AIXI and Existential Despair.&#8221;</p>
<p>In a randomized universe the analysis is less clear what happens; in general there is a risk of an analogous failure, but in a future post I will describe how to use randomness to specify counterfactual worlds.</p>
<h2>Attack by Simulator</h2>
<p>If a particular sequence of observations is embedded many times in the same universe (for example, by many distinct observers having identical experiences) then we expect the universal prior to put roughly equal weight on each of those embeddings (though this is only relevant if they diverge at some point). So  if we condition on a prefix long enough that the shortest explanation comes from the embedding in the universe, our predictions for subsequent bits are roughly uniform over continuations of observers&#8217; experiences consistent with that prefix.</p>
<p>We are interested in extrapolating the output sequence of a particular embedding (under appropriate counterfactuals) namely the one which we observe (and which we intuitively think o as the only embedding). But there are may be others. In particular, imagine an AI elsewhere in the universe aware of this experiment. Such an AI may create many new embeddings of these observations into the universe, with the goal of &#8220;hijacking&#8221; the resulting observation process and controlling its output (for example, it could simulate many procedures which behave identically to humans for the first part of their history and then turn over control to the AI). We can imagine a similar situation applied to humans themselves: for example, you expect your future observations to be a continuation of your past sensory experience, but if you tried to formalize the induction justifying that belief and use your expectations to specify your own world model, an AI motivated to do so could run many simulations of you up to the current moment and then modify their future experiences arbitrarily, controlling your (inductive) expectations arbitrarily.</p>
<p>How could such an AI exist? One possibility is interference from an alternative Everett branch in which a singularity went badly. Note that this possibility is very different from acausal trade between friendly and unfriendly AI: such an unfriendly AI gives up only a negligible share of its resources and might be able to achieve complete control over our branch.</p>
<p>We may be slightly protected from this possibility by an AI&#8217;s low measure or ignorance about our behavior (that is, depending on randomness and logical uncertainty which it cannot resolve we will probably pursue one of many different approaches, and a simulation attack can fool at most a few of them). We could try and exacerbate this difficulty by choosing as many parameters randomly as possible&#8211;but this would also reduce our own measure, and so would present no obstacle. Moreover, an interfering AI would be likely to have extremely considerable resources, and so be able to overcome a significant disadvantage coming from either the measure of its world or its uncertainty about our behavior. Overall, this protection does not seem substantial.</p>
<p>Another source for meddling AIs is our own future: if a particular uFAI is able to gain control of our choice process and uses it to take control of astronomical resources, it will then be able to use those resources to run many simulations and gain control of our choice process. This highlights a fact which seems to be relatively well understood on Less Wrong but not appreciated more broadly: a computationally bounded AI performing approximate Solomonoff induction (or even reasoning about the universal distribution) has degrees of freedom not specified by its goals and may use those degrees of freedom to exert acausal control over the universal distribution itself. For example, if the original AI decides to give control of the &#8220;human&#8221;&#8216;s decision process to a particular uFAI, that uFAI may be able to take over the universe and then run enough simulations to justify the original decision. But this may be simultaneously true for many AIs, friendly and not, and the original AI is able to make a choice amongst these options&#8211;or it may be able to choose for there to be no simulations, or it may simply turn out that the relationship between its actions and the universal prior makes it impossible for it to make any reasonable choice (see &#8220;Beliefs That are Always Wrong&#8221;) or etc.</p>
<p>We could avoid this entire class of failures if the shortest description of the human&#8217;s decision process consisted of a complete description of the human&#8217;s original mental state and the physical law governing the setup. Unfortunately, if we use the Kolmogorov complexity to define &#8220;simplest&#8221; this seems unlikely. Indeed, to the extent that we as humans believe that our own recollections are enough to provide substantial evidence about the world, and to the extent that we believe that the Solomonoff prior is a reasonable model for our own predictive frameworks, we must believe that our own brains (complete with our recollections) are most concisely described by modeling the universe that produced those recollections.</p>
<p>In a future post, I will describe alternatives based on other complexity measures which may avoid this problem. We may also be able to escape from this problem by specifying a narrower model of computation (for example by adding explicit error to some more biologically plausible model of a brain, or etc.) and searching for a good approximation within that class.</p>
<h2>TDT Impersonation</h2>
<p>Suppose our universe is a particular CA, and consider some TDT agent whose goal is to control the behavior of that CA in some way (say, a goal defined directly in terms of the states of cells, such as making the board return to its initial configuration as quickly as possible) and which implements an input output relationship defined by applying approximate TDT-style maximization.</p>
<p>If we were to run such an agent, using the input given to the &#8220;human in a box,&#8221; it might reason as follows:</p>
<blockquote><p>What are the consequences of me outputting a particular O in response to this input? Well, let me simulate the universe for a while to see if I can find this input embedded anywhere. I can!</p>
<p>Well, now I know that the input is being used as part of a test by some humans to try and isolate a human decision process. I don&#8217;t know all of the details exactly, because this same input pattern is embedded in the universe many times, but I have a pretty good idea.</p>
<p>If I want to take over the universe, it stands to reason that I should pretend to be a human for a while; if I pass that test then the search algorithm that the human is using will conclude that I implement the human&#8217;s decision process, and thereby grant me some control over their world&#8230;</p></blockquote>
<p>This reasoning may occur explicitly as our AI searches for a program to replicate the human&#8217;s output, or it may occur implicitly in the AI&#8217;s imagination as it logically reasons about the existence of such an algorithm.</p>
<p>In any case, if the Kolmogorov complexity of a TDT agent is shorter than the Kolmogorov complexity of specifying the embedding of the human in the universe in any other way, then such a TDT agent may be the simplest explanation for the human&#8217;s behavior. This is plausible primarily because the particular experiment we are running&#8211;for which the output may determine the entire future of our galaxy&#8211;is distinguished by the beacon of having great causal significance. The easiest way to specify &#8220;great causal significance&#8221; may be to write down an agent who is naturally driven to care about that event (or to write down a program which successfully searches for such an agent).</p>
<p>As with the simulation problem, we may be able to escape from this difficulty by specifying a complexity measure or class of models which favors concrete simulations of humans rather than universe-level descriptions.</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/23/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/23/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=23&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/15/hazards/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
		<item>
		<title>Specifying Humans Formally (Using an Oracle for Physics)</title>
		<link>http://ordinaryideas.wordpress.com/2011/12/14/specifying-humans-formally-using-an-oracle-for-physics/</link>
		<comments>http://ordinaryideas.wordpress.com/2011/12/14/specifying-humans-formally-using-an-oracle-for-physics/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 22:12:00 +0000</pubDate>
		<dc:creator><![CDATA[paulfchristiano]]></dc:creator>
				<category><![CDATA[AI Safety]]></category>
		<category><![CDATA[Formal Definitions]]></category>

		<guid isPermaLink="false">http://ordinaryideas.wordpress.com/?p=15</guid>
		<description><![CDATA[Although I don&#8217;t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, [&#8230;]<img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=15&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Although I don&#8217;t yet have any idea how to build an AI which pursues a goal I give it, I am optimistic that one day humans might. Writing down any understandable goal at all, much less one which humans approve of, looks like it might be quite hard even with a seed AI in hand, however. The issue is complicated by complete ignorance of the hypothetical AI&#8217;s ontology or the mechanisms by which the creators might have access to that ontology.</p>
<p>I do have some fixed points: I believe that any AI will probably at a minimum be able to reason about logical and mathematical truth, and I believe that many natural goals will want to use the subexpression &#8220;a human&#8217;s decision process&#8221; (for example appealing to a human&#8217;s decision process to make some judgment about input sequences).</p>
<p>This leads to a natural goal: if all I can talk to the AI about is math, how do I tell it &#8220;here is a human&#8221;?</p>
<p>Here is an attempt, though I will later give some reasons it may break down (for now the failures are left as an exercise for the counterfactual inquisitive reader).</p>
<p><span id="more-15"></span></p>
<p>Take a box with a hole in it. Inside the box, put a human together with an MRI scanner, a monitor, a keyboard, with their input and output channels wired through the hole. Let the human interact with the monitor/keyboard for a while, providing appropriate input; for example, hold a video conference between the person in the box and some people outside of the box, have the person in the box play a video game, etc. Let I be the input to the monitor, let O be the output from the keyboard, and let S be the MRI scan data.</p>
<p>Once we have (I, O, S) in hand, we can try to formally specify the human&#8217;s decision process as follows. Pick a function F from the universal distribution, which takes as input a stream of bits I and returns a pair of output streams which we will interpret as (O, S). Restrict attention to functions such that the kth bit of O and S depends only the first k bits of I. Now condition this distribution on agreement with observation, namely, that when applied to the available prefix of I, the function outputs the observed prefixes of O and S. The resulting probability distribution allows us to estimate what the response of the human in the box would be, if it were provided with the input string I.</p>
<p>Hopefully, the simplest function capturing the behavior of the human in the box is a physical simulation of that human. We include the MRI scan to ensure that there is enough data that the parameters for the physical simulation are less complex than a cruder specification (of course the model also has to describe how the MRI works, and it has to spend many bits describing errors and deviations from its physical simulation, but the high bandwidth of the MRI means that the physical simulation rapidly gains probability over a slightly less accurate model). We include the keyboard as an output channel as a way of pinpointing the human&#8217;s &#8220;intention,&#8221; without having to solve arcane technical problems (relying on whatever internal mechanism our brains use for giving our intentions control over motor function).</p>
<p>This gives us more or less the most general sort of access to a human decision process that we would want (we can ask questions and elicit responses with about as high a bandwidth as a human can support, and we can run the resulting simulation as many times as we like from an identical starting state). Best of all, it was specified purely in math. As long as we have access to the mathful parts of the AI ontology, this trick shows that we can specify a human decision process with little additional sweat (modulo the issues I will describe in subsequent posts).</p><br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ordinaryideas.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ordinaryideas.wordpress.com/15/" /></a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=ordinaryideas.wordpress.com&#038;blog=30479764&#038;post=15&#038;subd=ordinaryideas&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ordinaryideas.wordpress.com/2011/12/14/specifying-humans-formally-using-an-oracle-for-physics/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0a58766009bacac271106a2f92289d90?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">paulfchristiano</media:title>
		</media:content>
	</item>
	</channel>
</rss>
