Conjugate prior what is




















This also holds in a more general case: the derivation for the marginal likelihood and the posterior predictive distribution is the same; the only difference is in the value of the parameters of the conjugate prior distribution. This means that every time we can solve the posterior distribution in a closed form, we can also solve the posterior predictive distribution! Again, the third observation, which was the outlier, tilts the posterior predictive distribution immediately towards the higher values, until the it starts to resemble more or less the true generating distribution when more data is generated.

This is recurring theme in a Bayesian inference: when the sample size is small, the prior has more influence on the posterior, but when the sample size grows, the data starts to influence our posterior distribution more and more, until at the limit the posterior is determined purely by the data at least when the certain conditions hold. How complex model you want to fit? In general, more complex the model, more data you need.

In what resolution level you want examine your data? You may have enough data to fit your model at the level of the country, but what if you want to model the differences between the towns? Or the neighborhoods?

We will actually have a concrete example of this exact situation on the exercises later. The most often criticized aspect of the Bayesian approach to statistical inference is the requirement to choose a prior distribution, and especially the subjectivity of this prior selection procedure. Even in the most trivial coin-flipping example the choice of the binomial distribution for the outcome of the coinflip can be questioned: if we were truly ignorant about the outcome of the coinflip, would it make sense to model the outcome with a trinomial distribution, where the outcomes were head, tails and the coin landing on its side?

It can be argumented that we always use somehow our prior knowledge in the modelling process, but the Bayesian framework just makes utilizing prior knowledge more transparent and easier to quantify.

A less philosophical and more practical example of the inherent subjectivity of the modelling process is any situation in which our observations are continuous instead of the discrete. In practice, we have to impose some kind of the sampling distribution, for example the normal distribution, for the observations for our inferences to be sensible. Even if we do not want to impose any parametric distribution on the data, we have to choose some nonparameteric method to smooth a height distribution.

So this is the Bayesian counter-argument: the choice of the sampling distribution is as subjective as the choice of the prior distribution.

Take for instance a classical linear regression. It makes huge simplifying assumptions: that the true that the error terms are normally distributed given the predictors, and that the parameters of this normal distribution do not depend on the values of the predictors. Also the choices of the predictors inject very strong subjective beliefs into the model: if we exclude some predictors from the model, this means that we assume that this predictor has no effect at all on the output variable.

If we do not include any second or higher order terms, this means that we make a rather dire assumption that the all the relationships between the predictors and the output variables are linear, and so on. Of course the models with different predictors and model structures can be tested for example by predicting on the test set or by cross-validation , and then the best model can be chosen, but the same thing can be also done for the prior distributions.

So we do not have to choose the first prior distribution or hyperparameters that we happen to test, but like the different sampling distributions, we can also test different prior distributions and hyperparameter values to see which of them make sense.

This kind of the comparing the effects of the choice of prior distribution is called sensitivity analysis. Besides being the most criticized aspect of the Bayesian inference, the choice of the prior distribution is also one of the hardest.

If we have prior knowledge about the possible parameter values, it often makes sense to limit the sampling to these parameter values. The prior distribution which is designed to encode our prior knowledge of the likely parameter values and to affect the posterior distribution with small sample sizes is called an informative prior.

Using informative prior often makes the solution more stable with the smaller sample sizes, and on the other hand the sampling from the posterior is often more efficient when informative prior is used, because then we do not waste too much energy sampling the highly improbable regions of the parameter space.

Consequently, the Gaussian prior was a conjugate prior for our model above. That's all there is to it really -- if the posterior is from the same family as the prior, it's a conjugate prior. In simple cases you can identify a conjugate prior by inspection of the likelihood. The choice of the dominating measure is determinantal for the family of priors.

This difficulty is essentially the same as the one of choosing a particular parameterisation of the likelihood and opting for the Lebesgue measure for this parameterisation. When faced with a likelihood function, there is no inherent or intrinsic or reference dominating measure on the parameter space. Outside this exponential family setting, there is no non-trivial family of distributions with a fixed support that allows for conjugate priors.

This is a consequence of the Darmois-Pitman-Koopman lemma. I like using the notion of a "kernel" of a distribution. This is where you only leave in the parts that depend on the parameter. A few simple examples. When we look at the likelihood function, we can do the same thing, and express it in "kernel form".

For example with iid data. If we can recognise this function as a kernel, then we can create a conjugate prior for that likelihood. In some sense a conjugate prior acts similarly to adding "pseudo data" to the data observed, and then estimating the parameters.

Bernoulli ,. Beta ,. Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. The following diagram summarizes conjugate prior relationships for a number of common sampling distributions. Arrows point from a sampling distribution to its conjugate prior distribution.

The symbol near the arrow indicates which parameter the prior is unknown. These relationships depend critically on choice of parameterization , some of which are uncommon. This page uses the parameterizations that make the relationships simplest to state, not necessarily the most common parameterizations.



0コメント

  • 1000 / 1000