Laplace's approximation

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.^[1]^[2] The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.^[3]^[4]

For example, a (possibly non-linear) regression or classification model with data set $\{x_{n},y_{n}\}_{n=1,\ldots ,N}$ comprising inputs $x$ and outputs $y$ has (unknown) parameter vector $\theta$ of length $D$ . The likelihood is denoted $p({\bf {y}}|{\bf {x}},\theta )$ and the parameter prior $p(\theta )$ . Suppose one wants to approximate the joint density of outputs and parameters $p({\bf {y}},\theta |{\bf {x}})$

p({\bf {y}},\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}},\theta )p(\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}})p(\theta |{\bf {y}},{\bf {x}})\;\simeq \;{\tilde {q}}(\theta )\;=\;Zq(\theta ).

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood $p({\bf {y}}|{\bf {x}})$ and posterior $p(\theta |{\bf {y}},{\bf {x}})$ . Seen as a function of $\theta$ the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian ${\tilde {q}}(\theta )=Zq(\theta )$ , where we use $q$ to denote approximate density, ${\tilde {q}}$ for un-normalised density and $Z$ is a constant (independent of $\theta$ ). Since the marginal likelihood $p({\bf {y}}|{\bf {x}})$ doesn't depend on the parameter $\theta$ and the posterior $p(\theta |{\bf {y}},{\bf {x}})$ normalises over $\theta$ we can immediately identify them with $Z$ and $q(\theta )$ of our approximation, respectively. Laplace's approximation is

p({\bf {y}},\theta |{\bf {x}})\;\simeq \;p({\bf {y}},{\hat {\theta }}|{\bf {x}})\exp {\big (}-{\tfrac {1}{2}}(\theta -{\hat {\theta }})^{\top }S^{-1}(\theta -{\hat {\theta }}){\big )}\;=\;{\tilde {q}}(\theta ),

where we have defined

{\begin{aligned}{\hat {\theta }}&\;=\;\operatorname {argmax} _{\theta }\log p({\bf {y}},\theta |{\bf {x}}),\\S^{-1}&\;=\;-\left.\nabla _{\theta }\nabla _{\theta }\log p({\bf {y}},\theta |{\bf {x}})\right|_{\theta ={\hat {\theta }}},\end{aligned}}

where ${\hat {\theta }}$ is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and $S^{-1}$ is the $D\times D$ positive definite matrix of second derivatives of the negative log joint target density at the mode $\theta ={\hat {\theta }}$ . Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of ${\hat {\theta }}$ is usually found using a gradient based method, e.g. Newton's method. In summary, we have

{\begin{aligned}q(\theta )&\;=\;{\cal {N}}(\theta |\mu ={\hat {\theta }},\Sigma =S),\\\log Z&\;=\;\log p({\bf {y}},{\hat {\theta }}|{\bf {x}})+{\tfrac {1}{2}}\log |S|+{\tfrac {D}{2}}\log(2\pi ),\end{aligned}}

for the approximate posterior over $\theta$ and the approximate log marginal likelihood respectively.^[5] In the special case of Bayesian linear regression with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,^[6] and for Gaussian processes by Williams and Barber.^[7]

References[edit]

^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace's method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. Vol. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5.
^ MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method" (PDF).
^ Hartigan, J. A. (1983). "Asymptotic Normality of Posterior Distributions". Bayes Theory. Springer Series in Statistics. New York: Springer. pp. 107–118. doi:10.1007/978-1-4613-8242-3_11. ISBN 978-1-4613-8244-7.
^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". In Geisser, S.; Hodges, J. S.; Press, S. J.; Zellner, A. (eds.). Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2.
^ Daxberger, Erik; et al. (2021). "Laplace Redux - Effortless Bayesian Deep Learning". Advances in Neural Information Processing Systems. 34.
^ MacKay, David J. C. (1992). "Bayesian Interpolation" (PDF). Neural Computation. 4 (3). MIT Press: 415–447. doi:10.1162/neco.1992.4.3.415. S2CID 1762283.
^ Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes" (PDF). PAMI. 20 (12). IEEE: 1342–1351. doi:10.1109/34.735807.

[1] Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace's method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. Vol. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5.

[2] MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method" (PDF).

[3] Hartigan, J. A. (1983). "Asymptotic Normality of Posterior Distributions". Bayes Theory. Springer Series in Statistics. New York: Springer. pp. 107–118. doi:10.1007/978-1-4613-8242-3_11. ISBN 978-1-4613-8244-7.

[4] Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". In Geisser, S.; Hodges, J. S.; Press, S. J.; Zellner, A. (eds.). Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2.

[5] Daxberger, Erik; et al. (2021). "Laplace Redux - Effortless Bayesian Deep Learning". Advances in Neural Information Processing Systems. 34.

[6] MacKay, David J. C. (1992). "Bayesian Interpolation" (PDF). Neural Computation. 4 (3). MIT Press: 415–447. doi:10.1162/neco.1992.4.3.415. S2CID 1762283.

[7] Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes" (PDF). PAMI. 20 (12). IEEE: 1342–1351. doi:10.1109/34.735807.

[1]

[2]

[3]

[4]

[5]

[6]

[7]