Suppose I want to provide a completely precise specification of “me,” or rather of the input/output behavior that I implement. How can I do this? I might be interested in this problem, for example, because it appears to be a primary difficulty in providing a precise specification of “maximize the extent to which I would approve of your decision upon reflection.” (I have suggested that we would be happy with a powerful AI that made decisions according to this maxim.)
I have written about this issue in the past; in this post I’ll outline a slightly improved scheme (now with 100% fewer faraday cages). The technical changes are relatively modest, but I’m also taking a somewhat different approach to the issue, and overall I think it seems much more like the kind of thing that could actually be done. I also want to take the opportunity to try to clarify and expand the exposition some, since I think that the amount of discussion and thought that this idea has gotten now vastly surpasses the amount of care that went into crafting the original exposition.
I welcome additional objections to this scheme. As usual I think the literal proposal laid out here is extremely unlikely to ever be used. However, finding problems with this proposal can still be useful for shedding light on the problem, and in particular on how difficult it is and where the difficulties lie.
Continue reading →