Some thoughts on priors

Intro

This is a short work and a chance for me to jot down a few ideas I've been having lately on models of learning, and very specifically about how one can talk about Bayesian priors. This has been motivated by two distinct phenomena: 1. The tendency for within-linguistics discussion about learning to revolve separately around expressivity and probability. 2. The fascinating (to me) characterization of neural networks as Gaussian processes. I consistently find myself redefining terms in dealing with both of these, so I feel I am best served by centralizing my thoughts. Frankly, I find 1. boring but necessary and 2. exciting and underdiscussed.

Groundwork

For both of the below sections we will consider a Bayesian learning problem in the particular case where a learner must learn some function $f_\theta(x)=p(y \mid x; \theta)$ which itself maps samples $x \in X$ to probability distributions $P(Y)$. The learner is assumed to be imbued with some prior $\pi$ defining a distribution over parameters $\theta$. Assuming a learner observes samples $D \contains (x, y)$ the learner attempts to induce a posterior estimation either through Bayes rule or MAP or some other prior-informed mechanism.

The Expressivity/Plausibility Divide

The first distinction is one routinely made within the cognitive science literature between two camps who insist they are talking about different things. Some are very interested in learners whose expressivity is limited. This is to say, not all possible distributions $P(Y)$ are learnable. These can be construed as two ways of talking about priors. We can say we are concerned with a prior's expressivity when we are concerned with which parts of the prior has nonzero support. Otherwise, we may concern ourselves with a prior's plausibility (i.e., how the mass is distributed).

Why is this at all interesting? Discussions about expressivity restrictions play an outsized role in the psychology of language learning. I'm tired of it! Expressivity is a special case of the prior in a Bayesian framework, and talking about expressivity only ignores a whole lot of other important things when talking about Bayesian learning. This is especially true since, under (broadly applicable) suitable conditions decision-making functions can be coerced into a Bayesian form [1]. Of course, that doesn't mean you should.

The Intension/Extensional Divide

Far more interesting are two definitions of the prior defined according to the parameterization $\theta$ or what $\theta$ induces over the data itself. We can call such priors intensional or extensional respectively. Why do I think this is interesting? It should be stated from the beginning that almost all priors as conventionally used in psychological theory are intensional. For example, regression with weight decay implies a prior over weights corresponding to a Gaussian distribution. For example MDL defines priors according to the complexity of the function in some system of representation. This is useful, but it becomes difficult to talk about the priors of learners whose representations are black boxes, including the primary object of psychological study.

The extensional prior, however, can be defined according to the general learner by defining our expectations over the types of functions learned. Common techniques for doing this already exist, e.g. kernel methods. These define expectations over functions according to the function set itself, not the method by which the function is induced. For example, in a Gaussian process an RBF kernel defines an expectation that two inputs $x_1$ and $x_2$ will have similar distribution over $y$ if they are near to each other in the input space. But there's an additional beauty to using such extensional priors: if we can characterize features and behaviors of the input space then we can characterize the best distribution over functions a priori.

The whole reason I've been thinking of this lately has been the NNGP correspondence. It is well known that randoomly initialized neural networks are, in the infinite width limit, Gaussian processes. This means that they are characterizable in extensional terms as well as intensional ones, and in the case of neural networks specifically their kernel characterizes as inputs nearby which have high cosine similarity. This seems to be a great reason to expect them to be domain general learners!

References

Kiefer J. On Wald's Complete Class Theorems. The Annals of Mathematical Statistics 1953;24(1):75--83.