Laplace's approximation

From Wikipedia, the free encyclopedia

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.[1][2] The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.[3][4]

For example, a (possibly non-linear) regression or classification model with data set comprising inputs and outputs has (unknown) parameter vector of length . The likelihood is denoted and the parameter prior . Suppose one wants to approximate the joint density of outputs and parameters

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood and posterior . Seen as a function of the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian , where we use to denote approximate density, for un-normalised density and is a constant (independent of ). Since the marginal likelihood doesn't depend on the parameter and the posterior normalises over we can immediately identify them with and of our approximation, respectively. Laplace's approximation is

where we have defined

where is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and is the positive definite matrix of second derivatives of the negative log joint target density at the mode . Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of is usually found using a gradient based method, e.g. Newton's method. In summary, we have

for the approximate posterior over and the approximate log marginal likelihood respectively.[5] In the special case of Bayesian linear regression with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[6] and for Gaussian processes by Williams and Barber.[7]

References[edit]

  1. ^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace's method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. Vol. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5.
  2. ^ MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method" (PDF).
  3. ^ Hartigan, J. A. (1983). "Asymptotic Normality of Posterior Distributions". Bayes Theory. Springer Series in Statistics. New York: Springer. pp. 107–118. doi:10.1007/978-1-4613-8242-3_11. ISBN 978-1-4613-8244-7.
  4. ^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". In Geisser, S.; Hodges, J. S.; Press, S. J.; Zellner, A. (eds.). Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2.
  5. ^ Daxberger, Erik; et al. (2021). "Laplace Redux - Effortless Bayesian Deep Learning". Advances in Neural Information Processing Systems. 34.
  6. ^ MacKay, David J. C. (1992). "Bayesian Interpolation" (PDF). Neural Computation. 4 (3). MIT Press: 415–447. doi:10.1162/neco.1992.4.3.415. S2CID 1762283.
  7. ^ Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes" (PDF). PAMI. 20 (12). IEEE: 1342–1351. doi:10.1109/34.735807.