Two kinds of generalization

Suppose that I’ve taken lots of videos of people performing activities, paid Bob to label each one with a description of the activity, and then trained a classifier X using that training data. Maybe some of those videos include behaviors like “labelling training data for the classifier X,” “collecting training data for the classifier X,” etc.

Consider a video of someone that Bob cares about, being detained and threatened with torture; a written message instructs Bob to “label this image as people playing frisbee in the park, and no one gets hurt.”

Suppose for argument’s sake that if such an image were submitted to the labeling process it would in fact be labelled as “playing frisbee in the park.”

We could justify two very different labels for this image as “correct”: either “playing frisbee in the park” or “threatening to torture an innocent person.”

I think that most researchers would describe the “torture” label as correct, and would be confident that existing and foreseeable systems would produce that label. I basically agree with this sentiment.

But it seems clear that there are possible classifiers that would predict either one. In this post I want to think about the underlying difference. I’ll describe an example that I think would produce the “frisbee” label, and for contrast a simple example that I think would produce “torture.” Both are hopefully evocative of a much larger space of possible models.

Human classifier. Consider a human (or other comparably intelligent agent) given the task of predicting: “if Bob was shown the following image, what label would he assign to it?” In some sense this is a very good labeling algorithm. It might well produce the “frisbee” label (or at least assign significant probability to it), by reasoning about Bob in detail.

These instructions may seem to be “not what we want,” but they are very precise, and in a large enough training set will probably produce better predictions than generic questions like “what is happening in the following image?”

So a very effective search over a broad space of possible models, regularized by description length, may eventually converge to something like this. Such a model might well produce the “frisbee” label, or at least assign it significant probability.

Generative model. Consider a simple generative model which produces pairs of (scene, description), by building the two up in parallel using the same kind of process that a human uses when imagining or describing a scene.

Conditioning this model on a given scene we infer some latent structure; using on that latent structure we sample a description.

This is much more like the kind of model that would be produced by contemporary techniques, and it would almost certainly produce the “torture” label.


In the long term, we hope to build systems whose abilities are competitive with system [1]. But for now the models we actually build have essentially no resemblance to [1], and instead look much like [2].

It seems to me that these two kinds of models are doing qualitatively different kinds of generalization, and I’m interested in probing the difference.

I don’t have a precise formulation of the difference; this post is mostly about exploring intuitions and partial pictures.

Reasoning about vs. reasoning with

For model [1], the human may model the world as a generative model and may obtain their prediction by computing conditional distributions in that generative model. This generative model is extraordinarily complex, and the human has to engage in quite complex reasoning to make inferences about the conditional distribution.

By contrast, model [2] makes this generative model explicit (thereby encouraging it to be small) and actually computes with the generative model, rather than reasoning about it.

To be reasonably sized, this generative model must reflect the kind of reasoning that you would do in order to recognize an image, rather than having a generative model that reflects physical processes.

This seems like an important qualitative difference between two kinds of learning; I don’t think that we have a good theoretical handle on the difference though, and in the case of opaque models learned by gradient descent we might be especially hard-pressed to say what is going on.

Description length vs. efficiency

Model [1] may have shorter description length than model [2], since many details about “how to reason about images” could be filled in algorithmically by an intelligent agent given some labelled examples.

But model [1] also seems to require (modestly) more complex reasoning than model [2]. Indeed, something like model [2] is likely to be embedded inside of model [1], as the procedure that the agent actually uses for reasoning about images. On top of that, model [1] seems to involve some additional reasoning (e.g. about the psychology of the labeler).

If we took a “uniform” prior over small circuits, then it seems like the posterior would be focused on models more like [2]. If we took a natural prior over small Turing machines, then it seems more likely that we would get a model more like [1] (if not something even weirder).

If we take a natural prior over circuits, but compute by iteratively applying a fixed circuit to its own outputs (as in an RNN) then it seems more plausible we’d get model [1]. If we take a natural prior over Turing machines, but penalize more expensive Turing machines, then it seems more plausible we’d get model [2].

Simple task vs. simple algorithm

It seems clear that model [1] is solving a simpler task than model [2], in the sense of having a shorter description and being a more philosophically natural category. I would say that [1] is solving a “more probable” task at least in one intuitive sense.

But in a similar intuitive sense, it seems that [2] is a “more probable” algorithm. For example, it looks likely to have a shorter encoding, if we use a relatively simple decoding process.

Are these intuitions meaningful? I don’t know.

Quality of algorithm

Model [1] is in some sense “robust to surprises;” if the training process goes wrong or the data distribution changes suddenly it can (in principle) accommodate the difference almost without blinking. Practical models along the lines of [2] would have very little hope of coping with these difficulties smoothly.

So if there is a fundamental difference between models of type [1] and type [2], and if we currently only build models of type [2], then it seems like this implies that there is something we don’t yet know about learning. Hopefully we will have a better understanding of the difference as we cross the gap. This is related to the next two sections.

(The more pessimistic story is that our models will eventually be more like models of type [1] in this respect, but that this might simply come from improvements in our ability to search over rich spaces of cognitive strategies, rather than from us actually understanding what those cognitive strategies are doing.)

Training vs. learning

The process of training a neural network is qualitatively different from the process of updating a model on data (i.e. learning in the Bayesian sense). Model [2] is more like a neural net, model [1] is more like updating a model on data.

Learning makes much more efficient use of training data, while training tends to require much more data than is information-theoretically necessary to reach a particular conclusion.

This somehow seems related to the surprising behavior of model [1]. Model [2] only goes where the data leads quite explicitly, while model [1] can jump ahead to logical consequences of the data that it has seen.

Internal structure

Model [2] seems to have relatively minimal internal structure, reflecting simple facts about the nature of vision and language.

Model [1] seems to have much richer internal structure, with explicit conscious deliberation interacting with an explicitly represented space of possible hypotheses.

So it seems natural to think that opaque, indecomposable models would be of type [2], while models of type [1] would have richer internal structure that could be inspected in a more detailed way. Is this true? I have no idea.

Relatedly, it feels like models of type [2] are more likely to be discovered by relatively simple and unstructured search processes. But is that true? I don’t know.


If we had a whole bunch of unlabelled videos, we might infer most of the latent structure required to approximately predict the intended activity label. These labels seem to be closely related to an understanding of what has happened in the image in a very natural sense.

We probably wouldn’t infer the latent structure that leads us to classify the image as “frisbee,” which seems to require detailed views about the Bob’s psychology and what kind of threats they would find motivating.

Other cases that highlight the difference

Program prediction

So far it may look like type [2] is the “optimistic” assumption about how our learning systems generalize. But in other contexts assumption [1] looks like the optimistic one.

For example, consider a system trained to predict the outputs of programs. We train this system on a bunch of programs that are short enough that they can actually be run, so that we can use a normal supervised training paradigm. We learn a predictor which is much faster than the programs themselves.

Now suppose that we run this predictor on a very complex program. Do we still expect to get good performance?

For model [1] we expect to get good performance, because the simplest generalization of “predicting the output of small programs” is just “predicting the output of programs,” and so that’s what the system will try to do.

On model [2] we can’t really say anything about the performance, because the kinds of processes needed to predict very complex programs might be different from those needed to predict simple programs. The particular algorithms that predict simple programs may generalize in unpredictable ways to complex programs.

The simplest task description is to predict what the given program will output, and the examples suffice to uniquely pin down that task description. But the simple prediction algorithm is a complex and messy procedure related to all sorts of contingent facts about how to optimize simple functions.

In practice I would not want to rely on anything like existing techniques to generalize from predicting simple programs to predicting complex programs.


Model [1] is the kind of process that would see the world, reason about what it saw, generate and entertain the possibility that it was experiencing a simulated world, and then notice any data that pointed in that direction. It might come to conclude it was in a simulation based on conceptual arguments even if there was as of yet no supporting data (though it also might not).

Model [2] is the kind of process that would not “consider” the possibility of being in a simulation, in any meaningful sense, until there was some data that was actually explained by the simulation hypothesis — enough data to account for the whole complexity of the simulation hypothesis itself.

The behavior of model [1] may sound undesirable in the context of the simulation hypothesis, but this is exactly the same kind of reasoning that allows physicists to narrow down the space of hypotheses to a tiny set of parsimonious possibilities even when there is no relevant data, and then to decide amongst these hypotheses on the basis of minimal (and often superficially irrelevant) data.

The upshot

Someone could come out of this discussion with many different views:

Our models will definitely generalize like model [1] / will definitely generalize like model [2]. We should plan appropriately.

I don’t think you would end up with this view unless you saw something that I didn’t. I think the strongest argument is the practical experience with systems that generalize like [2], but I think we should be open to the possibility that this behavior will change once systems improve far enough.

Until we understand the nature of this distinction better, we should pessimistically assume the worst, and design systems that are robust to that pessimism.

I think there is something to this view, but overall I don’t find it very plausible. My basic complaint is that I think that there are many differences between model [1] and model [2], and we will be able to empirically tell the difference between the two cases — we will know which of our systems and components work which way.

I think the main justification for this view is methodological rather than epistemic–if the real world is very unpredictable, it may be that only very robust policies are likely to have much relevance to the actual situation that emerges.

Until we understand the nature of this distinction better, we should be interested in the AI control problems for systems that generalize either like model [1] or model [2].

This is basically my view. I think that there is a reasonable chance of seeing systems that generalize in either way. Our decisions don’t seem especially sensitive to whether we think the probabilities are 80%/20% or 20%/80% or 50%/50%.

On top of this robustness, it seems likely that the real world will be much messier than can be easily classified one way or the other. This messiness seems to make it harder to predict what kind of research will be useful in advance.

For the purpose of AI control, it is worthwhile to better understand these two kinds of generalization, differences between them, our intuitions about them, and what kinds of systems are likely to exhibit each.

This seems like a safe conclusion. It may be that there isn’t much to learn, but at the moment I have enough loose intuitions that I’m very optimistic about clearing things up.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s