My current preferred formalization of extrapolation of an agent’s preferences rests on imagining what would happen if that agent was provided with an idealized environment in which it could undergo an extensive process of reflection. It is clear that this is not a completely satisfactory account, though there is uncertainty about whether it is “good enough” for the intended use case.
One crisp difficulty is the following: this approach relies completely on the agent wanting you to know its extrapolated preferences.
If the agent didn’t–for example, if it expected that in general it would be worse off if other agents knew its extrapolated preferences–then the proposal breaks down entirely. In this case the agent will simply decline to participate in the hypothetical thought experiment, and the answer to the question “What would they decide if they had the opportunity to reflect at length?” is “nothing.”
As far as I know, this is a problem for all available formal specifications of an agent’s enlightened judgment, or idealized beliefs, or extrapolated preferences, or whichever. I now think that if I had a formal answer to this question I would understand the issue substantially better. So this is my first challenge: how can you define an extrapolation/idealization of an agent’s preferences, if it doesn’t want you to?
The point of this question is not that extrapolating an agent’s preferences against its interests seems useful; rather, it’s that there seem to be various potential problems of using the kind of proposal I advocate, and they might be broadly classified as “motivational problems.” So to make the difficulties more clear, it seems useful to consider this extreme case.
There are various ad hoc approaches around this difficulty, but they seem very problematic. For example, we could consider the agent’s behavior in a counterfactual that could actually arise, where it had some incentive to show its true colors because it would actually be making meaningful choices. Or we could consider the agent’s behavior in a counterfactual where it didn’t understand what was going on, or where it incorrectly believed that its output in the thought experiment would be used in a way that it would be happy with rather than one it would be unhappy with. But neither of these seems at all compatible with the agent learning a lot, both seem difficult to define due to the tricky counterfactuals (though maybe that’s a reasonable problem to attack if the other issues aren’t issues), and both seem very fragile and likely to break for unexpected reasons.
There is an alternative view, which says that the way to formalize “what people want” or “what they believe” is to have a better understanding of how the human brain works / how decision theory works, and then to extract “what someone wants” directly from their behavior. This motivates further investigation of just what the brain is doing, or at least optimism that the philosophical problem will be in better shape once these empirical investigations have gotten underway. I see the charm in this proposal, but I have a fairly serious gripe: I can’t imagine any realistic outcome of the empirical investigation that answered this question.
I can imagine outcomes of the empirical investigation such that the philosophical problem goes away, it’s just that these don’t seem realistic. For example, we might learn that the human brain is actually just a bunch of heuristics implementing a very simple algorithm, of the form “Build model of the world, decide what actions lead to outcome X, and then perform those actions.” But this seems fairly implausible based on my understanding of the brain (I think most experts agree), and beyond that would actually be deeply philosophically surprising.
What I expect to see in fact is that the human brain is a collection of many interacting subsystems, under some selection pressure for improved reproductive fitness (which means locally that the brain and each of its parts is under selection pressure to fulfill a wide range of instrumental goals, to varying degrees), though even that picture is not even very good. And then… what? I don’t see how we even extract a picture of what beliefs are “correct” on this picture, much less of what we “really value.”
Is there an intermediate picture, on which this understanding of the brain / decision theory will lead to a clear picture of human values? This is my second challenge: what could we plausibly learn about the brain and decision theory such that we would be able to define value extrapolation in a more direct way?