Specifying “enlightened judgment” precisely (reprise)

Suppose that I have in hand a perfect model of my decision-making process, and I am interested in using this to define what I would believe, want, or do “upon reflection.” That is, in general I can use this model to define my current best guess as to the answer, but I might also be interested in talking about my “enlightened judgment,” if I knew all of the facts and considered all of the arguments and were more the person I wish I were and so on. Can we give a satisfactory formal definition of my enlightened judgment in terms of this literal model of my decision-making process?

This seems particularly important in the case of really hard questions, where even posing the question to me would be essentially impossible. For example, if you wanted to figure out which of two states of the universe I would prefer, it’s not clear how you could even articulate a question to ask me, or what you could show me so that I would be sufficiently informed to render a judgment. Yet it feels like there is some important sense in which I do have preferences between states of the universe, even those that are too messy and complicated for me to comprehend directly.

As in my last post, I’d like to add the disclaimer that I think it is unlikely that anything resembling this proposal will ever be used. Instead, I see the value as mostly coming from helping to show us where the largest difficulties might be hiding (in the plausible but still not too likely case where we need to resolve problems similar to this one).

I know of two broad approaches to defining a human’s “enlightened judgment:”

  1. Understand and define what it is that humans are doing when they decide, and when they err. From this directly define “what a person wants” or “what they would decide if they did not err” or some more complicated dynamic that depends on these and other similar notions.
  2. Describe “ideal” conditions for reflection under which we think that a person would actually reach their enlightened judgment.

Both of these approaches seem to me to have serious problems, especially as a philosophical account of what we mean when we talk about “enlightened judgment.” Fortunately I am not particularly interested in the philosophical question, and am more interested in pragmatic issues.

The main reason I care about enlightened judgment is that it would be nice if society were be able to build a machine which can understand and behave according to our enlightened preferences about what it would be good for the machine to do. So when deciding between [1] and [2], my question is: which of these approaches would be better for telling a machine what to do?

At the moment, completing project [1] looks like it would be quite difficult; the available routes forward look like fairly long shots, and I’m far from convinced that the project is even philosophically coherent or should be expected to have a clean or satisfying objective. Project [2] has the virtue of being potentially very straightforward, though it also runs a much greater risk of failing to mean what we want it to mean.

In this post I want to say more about why I consider approach [2] relatively plausible. This means a lot of getting into the details of what those ideal conditions would look like, and what people would do under them. I won’t really say much about the philosophical issues at stake here; instead this solution will largely lean on claims like “Whatever we might be able to do as part of project [1], if the ideal conditions are good enough then these could also be done as one step in project [2].”

I think the largest remaining concern is about the motivations of the person reflecting. For example, if I try to use procedure [2] to reason about the enlightened judgment of someone who doesn’t care about the underlying issue (or moreover about what I think about what they think about the underlying issue) I would run into difficulties, since they wouldn’t actually use the opportunity to reflect. In fact it seems like there are even more severe difficulties along these lines, which at a minimum substantially restrict the set of conditions under which this procedure can be applied. My current best guess is that they aren’t deal-breakers, but they do seem worth thinking about more. Three issues in this space are discussed at the very end of the post.


In section 1 I specify the problem statement in more detail.

In section 2 I lay out the formal proposal.

The proposal is based on an interactive loop between a person and a powerful computing environment. In section 3 I describe the kinds of things which I expect the person to try to do in that computing environment. These aren’t technically part of the proposal, but (a) the person whose judgment is being extrapolated in this way will need to have some ability and inclination to implement this project in order for the proposal to be sensible, (b) they are important for evaluating the extent to which the proposal is sensible.

In section 4 I sample some techniques that I think the person could use, and general reasons to be optimistic that they may be able to succeed at the goals outlined in section 3.

In section 5 I will evaluate some objections to this proposal.

1. Problem statement

I’ll take as given some process P that implements the input-output behavior of a given person (or group of people) inhabiting a comfortable environment, interacting with a computer terminal, over a period of (say) a month. The problem statement is then to define a process, potentially calling P as a subroutine, which outputs the considered judgment of that person (or group of people) about a given question Q.

It’s worth emphasizing that we are just looking for a definition, not an implementation. So none of the processes we describe need to be tractable, and they won’t ever get run. They will merely be reasoned about.

I’ll also imagine that the process P has appropriate access to facts about the world; for example, it might have the entire contents of the internet, and copious video recordings of our world.


The question Q is not in fact very important. Almost all of the work will be done by the transformation from P into a much more powerful process. I imagine the process P reflecting for a very, very long time prior to even looking at the question Q.

On this perspective, for example, it makes basically no difference what format Q is written in. We might even be able to skip providing the question Q at all, and leave it to the program to infer the context from which it is being invoked, and what answer would be appropriate to provide in that context.

2. The definition

Basic setup

Our program P will have the form of running interactive loop between P and a particular computing environment E. For example, if E were an interactive python session, we could define the result of P interacting with an interactive Python interpreter by describing a loop that alternatives by advancing P by a millisecond and advancing E by one millisecond.

We will then output the result of that interaction. So the problem comes down to designing the computing environment E with which we want P to interact. Needless to say, the interaction between P and a normal interactive Python interpreter would not result in a particularly “enlightened” judgment.

Because we are merely interested in defining a process, we can talk about a computing environment with unbounded computational power. So at a minimum, we could take E to be an interactive shell with infinite computing power.

It’s not clear that this really improves the situation much (also it’s not yet formally defined; “a computer with infinite computing power” isn’t quite a thing). But I think it opens up some possibilities. In the next sections we will define some improvements to this environment, and in the following section I will define it formally.

This will complete the formal definition of E, and with it the entirety of our formal definition of enlightened judgment.

Using infinite computation

We can make the situation a little better by providing a shell equipped with some helper functions.

  1. An implementation of P itself.
  2. An implementation of E itself.
  3. Various utilities designed to allow different instantiations of E and P to interact with each other, to save the states of implementations of E and P, etc.
  4. Whatever else we can think of. Utilities to inspect the specifications of E and P, to easily write modified versions, to inspect the available data, and generally to save labor and minimize the risk of error.

We’ll take E to be such a shell, and let P+ be system consisting of P interacting with E.

Using these utilities, P+ can begin to implement some fairly powerful operations. For example, P+ could set up 10100 copies of itself. Each of them could be sent a message describing a random approach to solving a particular problem. They could then participate in a tournament, where pairs of copies of P+ debated between each other which came up with a better solution. The best solution could then be selected.

Also note that using these utilities, P+ can implement new useful utilities, and then pass off control to a new instantiation of itself which has access to an enlarged set of utilities. So hopefully there won’t be much sensitivity to the detailed choices.

(I certainly wouldn’t advocate this kind of brute-force approach, for reasons that will hopefully be clear by the end of the post, but it gives some indication of what you could do using these utilities and infinite computing power.)

Defining infinite computing

There are still some details missing from this specification. For example, what is this computer with infinite computing power actually doing? How can it reason about the output of new copies of itself? Aren’t there problems with diagonalization, no matter how you define it?

A first pass definition would be a machine which can run arbitrarily long computations, but which hangs if it goes into an infinite loop. Such a machine can call itself, and is in many ways quite satisfactory. However, if the machine actually crashes it would do a terrible job of reflecting the enlightened judgment of P, and we should endeavor to make this outcome as improbable as possible.

I should say at the outset: I don’t think these problems are serious ones at all, I think they are essentially annoying technical details. I think the standard for resolving them is: resolve them sufficiently well that we can completely ignore them.

One way to prevent this outcome is to change the machine so that it throws an exception if it enters an infinite loop. Unfortunately this attempted definition is impossible due to a standard problem with diagonalization: what is the output of the program f which calls itself, if it sees an infinite loop returns 0, and if it sees the output 0 enters an infinite loop?

A compromise solution is to allow “protected” calls, which throw an error if they reach an infinite loop, but to ensure that the protected calls can’t form a cycle. We can then require that each function called from the interactive shell E is protected, making it impossible for the overall system to throw an error. The tradeoff is that we now need to bound the number of times that P can delegate authority to a copy of itself (though we need not bound the number of copies that can be spun up in parallel).

One way of doing this is to simply label each such call with an integer, and to require these integers to decrease. This in turn requires labeling each invocation of P+ with an integer, such that the integers do not increase. For example, we might use P+[100], which isn’t able to invoke further copies of P+[100] but is allowed to invoke copies of P+[99].

Since we never really cared about infinities, but were happy enough with astronomically large finite numbers, we could just use P+[210000000], and I would be shocked if the distinction ever mattered. To make things even safer, we can index E’s by arbitrary ordinals. So for example, we can imagine P+[ω] which can invoke copies of P+[n] for any integer n (e.g. it can write a loop which instantiates a copy of P+[n] in the nth iteration).

And we can imagine P+[ω+1], which can invoke P+[ω], and we can go on in this way to P+[2ω] which can invoke a P+[ω+n]; to P+2], P+ω], and much larger orderings. For our purposes, I think this capability is completely sufficient and indeed even overkill.

From now on I am just going to go back to pretending that we have access to unlimited computational resources.

Of course we should also work with an idealized machine possessing infinite memory etc., and a sufficiently high-level language that we can run it on this idealized machine without running into technical difficulties.

3. Goals

It remains to speculate about what P might actually do, placed in this “ideal” environment. Indeed, it seems that P’s state of ignorance is roughly as severe as our own, and it’s not clear that we have offered any plausible route towards defining an enlightened judgment at all. In this section I will describe some of the projects and goals that I think P could pursue, each of which I think would leave it in a substantively better position than we are currently in.

Infrastructure improvements

One simple project for P+ is improving its own infrastructure: building a better programming language with more useful tools, designing a nicer environment for itself to live in, and so on. P+ can target changes that are extremely unlikely to change its enlightened judgment in a substantive or morally relevant way, but which make it more and more able to quickly and efficiently communicate its intentions formally, write new programs, explore new possibilities, and so on.

These improvements can build up across a massive number of generations. If this were the safest way forward, P+ could avoid doing anything else until its ability to quickly write reliable code and visualize the behavior of programs was radically beyond anything we are familiar with today.

These might also include improvements in P itself. For example, if the program P only captures a few months of time before needing to spin off another copy, P+ might work towards a version of P which persisted for years or longer (though this would naturally require more improvements in the environment, if we’d like to avoid insanity). P+ could also experiment with more approaches for sharing information between different instances of itself.


P+ is equipped with some amount of data about the outside world. The sparsity of this data presents a challenge for P+ contributing to progress in the empirical sciences in a normal way. However, P+ has many advantages (see section [4]), and if it were possible to overcome the data sparsity problem it would likely be able to achieve radically more progress than we are familiar with.

For some areas of science, for example theoretical physics (or the psychology or group dynamics of copies of P, economics of societies populated by copies of P, etc.), P+ might be able to make huge amounts of progress using the data has available (over the course of billions of subjective years).

In other domains, P+ can circumvent data sparsity problems by investigating simulations. Because P+ has access to unbounded computing resources, it can pursue simulations at a greater scale, higher degree of fidelity, and under a greater range of parameters than we would normally consider possible.

For example, P+ can consider vast numbers of possible universes governed by different laws of physics, and invoke copies of itself to inspect each of them, at many points in time and in many different ways. A sufficiently exhaustive search would eventually turn up worlds that resembled P+‘s home in arbitrary detail, and a copy of P+ assigned to inspecting that world from an appropriate vantage point would be able to notice this. This could give P+ access to a simulated copy of its home, together with infinite time to investigate, understand, and modify that simulation. It could also give P+ access to many additional individuals and groups, which it could consult or take on board as peers.

Perhaps most relevantly, if any civilizations anywhere in the space of possible civilizations resolve the problems that P+ is interested in, it could inspect their solutions, consider them at infinite length, and adopt them if they looked like adequate solutions. This suggests that at the point where P+ is equipped to explore many simulations of different physical laws, it may already be in a position to “win.”

Human enhancement

Especially if P+ were able to perform biology in simulation, but even if it were only able to inspect and work with the code defining P itself, P+ could aim to improve its own capabilities. Over a long time period, these small changes could amount to radical increases in capabilities, while still preserving the spirit of P’s original intentions.

For example, P+ could improve its own short-term or long-term memory, its ability to think carefully or quickly, its lifespan. P+ could simply acquire more knowledge; it could improve its ability to represent that knowledge well, to quickly train itself in new areas; it could build up more strategies from problem-solving, research, and planning, and frameworks for developing and reasoning about knowledge.

Psychology and philosophy

The same psychology and philosophy projects that we might pursue in order to define enlightened judgment can also be pursued by P+. If it were to succeed at such a project it might simply be able to directly define its enlightened judgment and then use its computational resources to evaluate that. If not, these projects could still inform its approach to the other projects mentioned.

P+ could work to become more aware of its own motives, and work to enshrine them in ways that reduced the risk of value drift while pursuing other projects. And so on.

Artificial intelligence

The same artificial intelligence projects that humans might pursue in order to build smarter agents that shared their values could also be pursued by P+. Even if it were impossible to build agents that shared P’s values when operating autonomously, P+ would have access to a very broad range of approaches which leveraged the availability of infinitely many copies of P+ (whether as overseers of other agents, participants in an institution, or components of an AI).

4. Techniques and advantages

When pursuing any of the projects above, P+ would have a number of very significant and distinctive advantages, suggesting it is more likely to succeed (and to succeed in a way more faithful to its goals) than any real group of humans. Most of these advantages would be available from a very early stage.

Taking it slow

Perhaps most importantly, P+ has the luxury of nearly unlimited time. In the real world, there are competitive pressures that lead us to work quickly, sometimes risking a probability of error in exchange for faster progress. We also face risks of negative outcomes which force us to move more quickly than we might otherwise like. Even if P+ had no other advantages, I think this one would probably be significant enough to carry the argument. There are a large number of advantages from proceeding slowly if your goal is maximizing probability(success), including the ability to quickly back off from potential problems, to look at something from many perspectives, to take very small and safe steps, to develop as much machinery as is helpful to resolve the problem, and so on.

A million pairs of eyes

P+ also has the luxury of instantiating an almost unlimited number of critics of any proposal that it is considering implementing. That is, it can generate a million copies of itself and ask each one to look in a certain direction for flaws, until finally being presented with the most compelling issues that any copy identified. It can also look over (or invoke copies to look over) processes at a very high level of detail, for example stepping through code line by line, or watching another copy’s judgment unfold at 0.001x speed. When arbitrating a dispute between two copies of P+ that came to different judgments, the arbiter can invoke a thousand times more resources than either copy and make manipulation very difficult or impossible. If motivation allows this process can be taken to comical lengths, for example developing an entire branch of mathematics or an entire civilization to characterize the behavior of a particular program with extremely high confidence.

Recursion and tail recursion

A very simple but important fact for P+‘s ability to get things done in limited time is that P+ can recurse in two ways. It can spin off copies to work on subproblems. But if the end of its time approaches and the overall problem has not yielded, P+ can also spin up copies to work on the original problem, now endowed with some additional info that P+ has acquired, the ability to have discussions with copies of the current version of P+, and so on. In each of its lifetimes, P+ only needs to get incrementally closer to its goal.

Reviewing outputs

P+ is free to run whatever process it wants, and then to apply final sanity-checks to the output, ensuring that they are reasonable and avoiding mistakes.

Infinite computing

Having the option of using infinite computing seems likely to be a significant aid for P+, though it’s not as obviously and materially helpful as the other consequences mentioned. For example, if we think that understanding psychology is a useful step towards defining enlightened judgment, and that unlimited computational resources or unlimited numbers of simulated people could be used to improve psychology, then P+ would be at an advantage. If we thought it was possible to design an AI that shared our values and could make better judgments, but that it would only make such good judgments by being astronomically slow, then that would be an option for P+ but not for us. Moreover, P+ would actually have the ability to do experiments with such AI’s.

5. Objections

In general when evaluating a proposal like this, it seems instructive to consider the tradeoff between making a more careful or complicated definition ourselves vs. using a simple definition with an extra level of indirection (i.e. delegating the problem). The natural risks of using the extra level of indirection are that (a) you might mess up the indirection itself, (b) the people you delegate to might mess up. Delegating less labor means taking more chances to mess up yourself, and leaving fewer to the people you are delegating to.

In this case we are delegating to the process P+, which seems much better equipped to do a good job than we are, and whose time is much less valuable. So my inclination is to delegate as much as possible while we are at it. Even implementing relatively basic functionality seems to go against our comparative advantage: it increases the time required by the project, and our probability of messing up is substantially larger than P+‘s. (That is, we should stop improving P+‘s environment once we think that P+ can do it more reliably than we can.)

That said, there are a few natural concerns:

  1. We might mess up the first level of indirection; this is adding an extra layer of complexity, and therefore unnecessarily increasing the probability of a mistake somewhere.
  2. We have some capabilities that P+ lacks, in particular we have access to a research community with many diverse humans rather than a small group of people. We also have good empirical evidence that the research community can make progress, whereas we are less sure about this group.
  3. P+ is being asked to solve a much harder problem, e.g. since it probably has to solve physics and simulate our universe as a preliminary step.
  4. Lots of things have to go right for this to work. That just seems generically improbable.
  5. Although P+ is better equipped than we are to make progress, it may not be so inclined.

Of these, I consider the last the most serious objection. My overall view is that unless we had a clearer sense of the issues in the last bullet point we should be a bit concerned about implementing a procedure like this, and we would at a minimum need to take a lot of care in choosing people. My expectation is that these difficulties afflict most procedures for defining enlightened judgments.

Messing up the indirection

The relevant comparison we face is:

  • Our ability to correctly specify P+, and P+‘s ability to specify P’s enlightened judgment.
  • Our ability to correctly specify P’s enlightened judgment.

I think that P+ is substantially better able to specify P’s enlightened judgment, so we are gaining on that term; but how much do we lose by needing first to correctly specify P+?

My own best guess is that we don’t lose that much. Specifying P may be complicated but is relatively easy to test and is basically a common difficulty of any sensible approach (though in many settings “specify” may not involve so much precision, or may be able to take a very different approach). Specifying E appears to be extremely easy. Moreover, most ways in which the system could fail (which would result in the enlightened judgment being incoherent or nonsensical) would be easily detectable. The only failure cases that may be subtle to detect are those in which the process is “hijacked” by a goal-oriented agent, who is motivated not only to manipulate our definition of P’s enlightened judgment, but also to conceal this hijacking.

I think this is plausible as a result of messing up the indirection (rather than P+ messing up), but not too likely. The main way it could happen is by the specification of P itself being hijacked, a possibility I have discussed in the past and which I think can probably be avoided if it is kept in mind.

We have capabilities P+ lacks

This complaint becomes more severe the more minimal the specification of P becomes. If the specification of P only persists for an hour, is a little bit crazier than the original, and lives in a psychologically straining environment, then there may be many respects in which we are a more informed decision-maker than P+. If the specification of P involves a reasonable range of researchers in a comfortable environment, then the gap becomes smaller but still exists. In between, the risks are in between. I don’t know how hard specifying more realistic and more varied versions of P would be, though my guess would be “not that much harder than specifying any P at all.”

Over the very long term I suspect issues like the psychological stability and diversity of perspectives would have substantial value, e.g. for solving the kind of philosophical problems that P+ might want to solve. However, the more relevant question is instead how relevant these capabilities are for P+ in the first part of its task, where it is attempting to improve on the original specification of P.

One approach to this improvement involves the brute-force search through the space of possible worlds to find a world like ours, identifying people like us within it, and then expanding P+ to consult those people for advice and incorporate them into the decision-making process. This looks to me like much more of an engineering problem than a philosophical problem, on which the value of a diversity of perspectives is much reduced (and can apparently be largely replaced with more effort by P itself). And if this problem is successfully resolved, then our advantage from P+ is largely eroded.

P+ is being asked to solve a really hard problem

Even once it’s done all of the low-hanging improvement and enhancing and so on, it may seem that the problem P+ faces is just way harder than the problems we face. The main way in which this problem may be more challenging is that it might require finding a simulation of a world like our own (a problem which we would consider fantastically difficult to actually solve here in the real world).

I think this is mostly a failure of imagining the impact of an astronomical number of people or having a billion years. Theoretical physics seems like a relatively easy problem to solve if you are allowed to do human-guided brute force searches over possible hypotheses. Searching through all of the plausible physical theories seems like a relatively easy problem to solve if you can simulate every program of less than a million lines of code, and have someone look by hand through every one of them searching for signs of life.

P+ has to do lots of things right

In some sense the success of this proposal rests on a substantial conjunction, with us successfully specifying P+, P+ successfully carrying out a number of projects, defining successors who successfully carry out further projects, and so forth.

But in other sense the success of this proposal mostly rests on a massive disjunction. P+ is being given an incredible amount of time to think and explore the possible approaches to its problem, and it needs only one of these to work. And those designing the protocol and hoping to actually deploy it would similarly have access to a wide range of procedures to improve and test the protocol.

In a similar sense, human technological progress to date seems to have depended on a huge conjunction of individual capabilities. We must develop fire, then better furnaces, then better metals to make better furnaces, then engines to make better engines to facilitate increasingly intricate chemical processes to make tiny machines to make even tinier machines and so on. The reason that we still see progress is that there are so many possible paths forward that even if 99% of them fail at every stage, there are still plenty that succeed.

I expect P+ to be in fine shape for similar reasons. It also seems quite plausible that P+ could resolve its problems without taking many steps, for example by just sitting and thinking for 30 years.

P+ may be capable, but not motivated

Trigger warning: this section is much more aggressively weird than the rest of the post. Thinking about it may be unproductive, or in rare cases distressing. If you suspect you might be distressed by something in a blog post by me, probably stop here. If you suspect that you might lose respect for me because I’m too weird, you should stop here too.

So far I have only made arguments that P+ would be capable of doing much better things than we would. But this would be little use if P would not be motivated to apply these capabilities in the intended direction. There are a number of particular reasons to be concerned about this possibility.

Why be motivated at all? It’s not clear why P+ would ever do anything to reflect P’s enlightened preferences, instead of e.g. playing games all day or going off and building a rich and interesting civilization. There is an implicit assumption that P cares about making its enlightened preferences known and so will put in the effort to do so.

This seems to be a rather severe difficulty with using this process to define the enlightened judgment of most people in most cases. The phrase “what they would I believe if I thought longer…” is actually doing a lot of work already via the implicit counterfactual  “what if I gave a damn about the question at issue…”

I don’t see a way around this problem, other than simply restricting the applicability of the scheme to people who care about having their enlightened preferences known. I think this points to a more general difficulty with attempts to define enlightened judgment, but not one I’ll get into.

I would note that the lives of the people in the simulation can really be quite good, especially if P is a model of a large group. (If P is not a model of a large group, their selfish interest may provide them a further reason to bootstrap up to the phase where P+ can invoke simulations of many different individuals). That is, the effectiveness of the scheme is only very modestly affected if the participants decide to enjoy themselves most of the time, and only work for a few hours a day. Depending on their views about simulations, they may also be able to “retire” when their month is up, running simulations of themselves under whatever conditions they like.

So it’s not necessarily the case that someone needs a massive motivation to make their enlightened preferences known. They may already be able to live a pretty awesome life, and the question is merely whether they care enough about making their enlightened judgment known that they are willing to make modest sacrifices. I think that for many people, knowing that this is “their purpose,” they would be happier spending a small fraction of their time working on it. But nevertheless, this is still a material restriction, and the protocol would be expected to work better with people who cared more about the project.

Views about simulations. One way in which P differs from the original is that it has different beliefs about its own nature. For example, suppose P were a model of me, the me of this very moment, and we were to ask “what would P do, if they found themselves in a featureless room with a computer terminal?” I don’t expect to find myself transplanted to a featureless room with a computer terminal. And so, if I found myself in the room, I would respond with extreme surprise.

But after a while surprise would give way and I would start to ask further questions. In the hypothetical, my experiences would seem just as real to me as they do right now. I would wonder, why is that? Have I been wrong, about what it is that makes experiences real; about what it is to exist? The original thinks that obviously P is just being considered in a counterfactual; P may feel differently about it.

Predicting what I would actually do is a hard question. It’s not the kind of situation in which people regularly find themselves. For many people, I think that once they got over the shock they would assume that they were in a simulation, or that there was some kind of trick, or maybe in weird cases they would revise their metaphysics. In any case, they would become very interested in the nature of the reality they now found themselves in. Myself, I like to think that I’m sensible enough that I’d correctly infer that I’m merely in a hypotheticaland I’d make decisions appropriately. But it’s hard to say.

And what would they infer about the reality they found themselves in? It’s hard to say, but it’s certainly not clear that their behavior would be anything like the proposal we had in mind. Perhaps they would just even more strongly accept the self-interested argument to enjoy themselves, believing that they are probably in a simulation rather than an elaborate thought experiment.

Maybe instead they would doubt that their simulation is as it appears, and be concerned that a simulator would make their future a happy or sad one based on what they did. They would then be in the business of speculating about the simulator’s motivation, and it’s hard to say how that would end. This is the human version of the problem I described here; to the extent that humans can avoid such problems it shouldn’t really be any more of a problem in this case than e.g. for the people actually implementing P+, who might as well also be in simulations. This makes me less concerned about the problem, though I understand that putting someone in such a radically unfamiliar situation is more likely to lead to trouble. (See the discussion at the end of this subsection.)

(This concern was first articulated as such by Vladimir Slepnev, though similar issues also cropped up in early discussions of the idea.)

Overall this difficulty makes it even more clear that this procedure couldn’t be used to extract most people’s enlightened judgments. In addition to possible concerns about capability and motivation, it’s also the case that many people would probably behave unpredictably once they found themselves in such an implausible situation.

If we are in the regime of just wanting to find anyone who responds to the situation in the intended way, then it becomes plausible that we could find good candidates.

The most principled solution to these difficulties would be to find the perfectly altruistic P that cared only about its enlightened judgment being known, but this seems like a stretch. The next best solution would be to either make a precommitment to not respond to such blackmail, or to find a human who used the kind of decision procedure that led them to reject blackmail on principle. A third best option would be to find humans who simply don’t buy the simulation argument for any number of reasons, e.g. they have a higher prior on being in a hypothetical than in a simulation, they use a very different framework for reasoning about these questions, etc. That said, if there were philosophical disagreement about the relevant issues, then there would be a significant risk that on reflection P’s views would change, and the problem would reemerge. I think that these are realistic possibilities to solve the problem at the selecting a person/group stage, but they much limit the pool and are a further thing to keep in mind.

That said, I would not be surprised if after only a short while people came to adjust to the hypothetical as the new normal, accept plausible philosophical assumptions concerning hypotheticals, and generally return to the behavior which they originally intended, mostly ignoring the prospect of being in a simulation in the same way that we do now (and just as today, they could rest rationally secure in the knowledge that they are doing the right thing on altruistic grounds).

A positive aspect of this concern is that it involves confronting a fixed problem rather than an adversary who may behave in an unknown way. That is, we are concerned about a simulator manipulating P not by revealing any new data to P, but by the mere situation changing P’s outlook. This is a prospect which can be more plausibly evaluated before the fact. It also may be possible to run tests in low-stress environments to understand how (smart agents expect) people to behave in these situations. Since all that matters will typically be how smart agents think the process would unfold, the experiments directly reveal the characteristics we care about. Of course the actual behavior of manipulating simulators may vary between low-stakes and high-stakes situations, but the psychological response of humans involved is less likely to vary along with it.

The problem is sharpened if we expect P+ to eventually learn a great deal about its universe, for example by simulating many universes and inspecting their properties. If in fact simulators think that it would be worthwhile to simulate it to affect its judgments, then P+ would eventually learn this by seeing it rather directly (and simulators, realizing this, would be more keen to do so). At this point it is plausible that P+‘s behavior would be under further pressure, as it might be able to see the prospect of extortion directly. Subprocesses that noticed this fact may be pressured to report it to their parents or to manipulate their parents in other ways, and it could propagate out to the original agent. There are various complications in this scenario I don’t want to dwell on, but I think the basic point goes through. I think this particularly extreme form of the problem can be in large part avoided by P constructing explicit precommitments (or taking other measures) in advance of coming into contact with this kind of observation.

Overall I think this issue is kind of muddled, and my current status is “it seems problematic but definitely not a clear deal-breaker.”

Change of identity. Even if P didn’t respond dramatically to the possibility of being simulated, over time we should expect its identity to diverge from the identity of the original person. This might change its preferences (or perhaps beliefs) to change. As Wei Dai points out, this seems particularly and almost necessarily true for selfish preferences. P’s self-interest is no longer aligned with the original person’s self-interest, even if the model is perfect. So instead of eliciting the original subject’s enlightened preferences, we’ll get P’s enlightened preferences.

I consider this problem problematic, but not massively so. There are a number of reasons for my nonchalance:

  • It seems probably possible to find people who are willing to participate in the experiment and who are pretty happy to “do their job” of reporting the original’s preferences, even if on reflection the original would conclude that they care about selfish values. This is most difficult if people reliably converge to similar views upon reflection and these views are mostly selfish; I consider this plausible but not too likely, and even conditioned on this I think that finding someone who would execute the plan is hard but not impossible.
  • From my current vantage point, I am not so concerned if a simulation of me replaces me in the universe, so even if they care a lot about them per se (and want to replace me with them) I don’t mind too much. From behind the veil of ignorance, it’s not something worth sacrificing value for.
  • I don’t think that very narrowly selfish values are a large part of what the resources in the universe will get spent on, due to diminishing returns and expanding circle type dynamics. I think that everyone involved can coexist relatively peacefully with respect to selfish preferences. I recognize that this view could easily be overturned by evidence.

Overall I would agree that this is a philosophical issue worth thinking more about if this approach actually looks like something we might do, and it’s not one which can be easily delegated via this procedure.


One thought on “Specifying “enlightened judgment” precisely (reprise)

  1. Pingback: Challenges for extrapolation | Ordinary Ideas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s