Lecture 11: Conjugate Priors and Bayesian Regression

STA702

Merlise Clyde

Duke University

Semi-Conjugate Priors in Linear Regression

  • Regression Model (Sampling model) \[\mathbf{Y}\mid \boldsymbol{\beta}, \phi \sim \textsf{N}(\mathbf{X}\boldsymbol{\beta}, \phi^{-1} \mathbf{I}_n) \]

  • Semi-Conjugate Prior Independent Normal Gamma \[\begin{align*} \boldsymbol{\beta}& \sim \textsf{N}(\mathbf{b}_0, \boldsymbol{\Phi}_0^{-1}) \\ \phi & \sim \textsf{Gamma}(\nu_0/2, \textsf{SS}_0/2) \end{align*}\]

    • Conditional Normal for \(\boldsymbol{\beta}\mid \phi, \mathbf{Y}\) and
    • Conditional Gamma \(\phi \mid \mathbf{Y}, \boldsymbol{\beta}\)
    • requires Gibbs sampling or other Metropolis-Hastings algorithms

Conjugate Priors in Linear Regression

  • Regression Model (Sampling model) \[\mathbf{Y}\mid \boldsymbol{\beta}, \phi \sim \textsf{N}(\mathbf{X}\boldsymbol{\beta}, \phi^{-1} \mathbf{I}_n) \]

  • Conjugate Normal-Gamma Model: factor joint prior \(p(\boldsymbol{\beta}, \phi ) = p(\boldsymbol{\beta}\mid \phi)p(\phi)\) \[\begin{align*} \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{b}_0, \phi^{-1}\boldsymbol{\Phi}_0^{-1}) & p(\boldsymbol{\beta}\mid \phi) & = \frac{|\phi \boldsymbol{\Phi}_0|^{1/2}}{(2 \pi)^{p/2}}e^{\left\{- \frac{\phi}{2}(\boldsymbol{\beta}- \mathbf{b}_0)^T \boldsymbol{\Phi}_0 (\boldsymbol{\beta}- \mathbf{b}_0) \right\}}\\ \phi & \sim \textsf{Gamma}(v_0/2, \textsf{SS}_0/2) & p(\phi) & = \frac{1}{\Gamma{(\nu_0/2)}} \left(\frac{\textsf{SS}_0}{2} \right)^{\nu_0/2} \phi^{\nu_0/2 - 1} e^{- \phi \frac{\textsf{SS}_0}{2}}\\ \Rightarrow (\boldsymbol{\beta}, \phi) & \sim \textsf{NG}(\mathbf{b}_0, \boldsymbol{\Phi}_0, \nu_o, \textsf{SS}_0) \end{align*}\]

  • Normal-Gamma distribution indexed by 4 hyperparameters

  • Note Prior Covariance for \(\boldsymbol{\beta}\) is scaled by \(\sigma^2 = 1/\phi\)

Finding the Posterior Distribution

  • Likelihood: \({\cal{L}}(\beta, \phi) \propto \phi^{n/2} e^{- \frac{\phi}{2} (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})}\)
    \[\begin{eqnarray*} p(\boldsymbol{\beta}, \phi \mid \mathbf{Y}) &\propto& \phi^{\frac {n}{2}} e^{- \frac \phi 2 (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) } \times \\ & & \phi^{\frac{\nu_0}{2} - 1} e^{- \phi \frac{\textsf{SS}_0}{2} }\times \phi^{\frac{p}{2}} e^{- \frac{\phi}{2} (\boldsymbol{\beta}- \mathbf{b}_0)^T \boldsymbol{\Phi}_0 (\boldsymbol{\beta}- \mathbf{b}_0) } \end{eqnarray*}\]

  • Quadratic in Exponential \[\exp\left\{- \frac{\phi}{2} (\boldsymbol{\beta}- \mathbf{b})^T \boldsymbol{\Phi}(\boldsymbol{\beta}- \mathbf{b}) \right\} = \exp\left\{- \frac{\phi}{2} (\boldsymbol{\beta}^T \boldsymbol{\Phi}\boldsymbol{\beta}- 2 \boldsymbol{\beta}^T \boldsymbol{\Phi}\mathbf{b}+ \mathbf{b}^T\boldsymbol{\Phi}\mathbf{b})\right\}\]

    • Expand quadratics and regroup terms
    • Read off posterior precision from Quadratic in \(\boldsymbol{\beta}\)
    • Read off posterior mean from Linear term in \(\boldsymbol{\beta}\)
    • will need to complete the quadratic in the posterior mean due to \(\phi\)

Expand and Regroup

\[\begin{eqnarray*} p(\boldsymbol{\beta}, \phi \mid \mathbf{Y}) &\propto& \phi^{\frac {n}{2}} e^{- \frac \phi 2 ( \mathbf{Y}^T\mathbf{Y}- 2 \boldsymbol{\beta}^T \mathbf{X}^T \mathbf{Y}+ \boldsymbol{\beta}^T \mathbf{X}^T \mathbf{X}\boldsymbol{\beta})} \times \\ & & \phi^{\frac{\nu_0}{2} - 1} e^{- \phi \frac{\textsf{SS}_0}{2} }\times \phi^{\frac{p}{2}} e^{- \frac{\phi}{2} (\boldsymbol{\beta}\boldsymbol{\Phi}_0\boldsymbol{\beta}- 2 \boldsymbol{\beta}^T \boldsymbol{\Phi}_0 \mathbf{b}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0) } \end{eqnarray*}\]

\[\begin{eqnarray*} p(\boldsymbol{\beta}, \phi \mid \mathbf{Y}) &\propto& \phi^{\frac {n + p + \nu_0}{ 2} - 1} e^{- \frac \phi 2 (\textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0) } \times \\ & & e^{-\frac{\phi}{2} (\boldsymbol{\beta}^T(\mathbf{X}^T\mathbf{X})\boldsymbol{\beta}-2 \boldsymbol{\beta}^T\textcolor{red}{\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}}\mathbf{X}^T\mathbf{Y}+ \boldsymbol{\beta}\boldsymbol{\Phi}_0\boldsymbol{\beta}- 2 \boldsymbol{\beta}^T \boldsymbol{\Phi}_0 \mathbf{b}) } \end{eqnarray*}\]

\[\begin{eqnarray*} & = & \phi^{\frac {n + p + \nu_0}{ 2} - 1} e^{- \frac \phi 2 (\textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0)} \times \\ & & e^{ -\frac{\phi}{2} \left( \boldsymbol{\beta}^T (\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}\right) } \times \\ & & e^{ -\frac{\phi}{2} \left( -2 \boldsymbol{\beta}^T (\mathbf{X}^T\mathbf{X}\textcolor{red}{\hat{\boldsymbol{\beta}}} + \boldsymbol{\Phi}_0 \mathbf{b}_0) \right)} \end{eqnarray*}\]

Complete the Quadratic

\[\begin{eqnarray*} p(\boldsymbol{\beta}, \phi \mid \mathbf{Y}) &\propto& \phi^{\frac {n + p + \nu_0}{ 2} - 1} e^{- \frac \phi 2 (\textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 )} \times \\ & & e^{ -\frac{\phi}{2} \left( \boldsymbol{\beta}^T \textcolor{\red}{(\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)} \boldsymbol{\beta} \right) } \times \qquad \qquad \qquad \qquad \boldsymbol{\Phi}_n \equiv \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0 \\ & & e^{ -\frac{\phi}{2} \left( -2 \boldsymbol{\beta}^T \textcolor{red}{\boldsymbol{\Phi}_n \boldsymbol{\Phi}_n^{-1}} (\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right)} \times \qquad \qquad \mathbf{b}_n \equiv \boldsymbol{\Phi}_n^{-1} (\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \\ & & e^{ -\frac{\phi}{2} ( \textcolor{red}{\mathbf{b}_ n^T \boldsymbol{\Phi}_n \mathbf{b}_n - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n}) } \end{eqnarray*}\]

\[\begin{eqnarray*} & = & \phi^{\frac {n + \nu_0}{ 2} - 1} e^{- \frac \phi 2 ( \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n)} \times \\ & & \textcolor{red}{\phi^{\frac p 2}} e^{ -\frac{\phi}{2} \left( (\boldsymbol{\beta}^T - \mathbf{b}_n)^T \boldsymbol{\Phi}_n (\boldsymbol{\beta}- \mathbf{b}_n) \right) } \end{eqnarray*}\]

\[\begin{eqnarray*} & \propto & \phi^{\frac {n + \nu_0}{ 2} - 1} e^{- \frac \phi 2 ( \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n)} \times \\ & & \textcolor{red}{|\phi \Phi_n |^{\frac 1 2}} e^{ -\frac{\phi}{2} \left( (\boldsymbol{\beta}^T - \mathbf{b}_n)^T \boldsymbol{\Phi}_n (\boldsymbol{\beta}- \mathbf{b}_n) \right) } \end{eqnarray*}\]

Posterior Distributions

Posterior density (up to normalizing contants) \(p(\boldsymbol{\beta}, \phi \mid \mathbf{Y}) = p(\phi \mid \mathbf{Y}) p(\boldsymbol{\beta}\mid \phi \mathbf{Y})\) \[\begin{eqnarray*} p(\phi \mid \mathbf{Y}) p(\boldsymbol{\beta}\mid \phi \mathbf{Y}) & \propto & \phi^{\frac {n + \nu_0}{ 2} - 1} e^{- \frac \phi 2 ( \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n)} \times \\ & & (2 \pi)^{- \frac p 2} |\phi \Phi_n |^{\frac 1 2}e^{- \frac{\phi}{2} (\boldsymbol{\beta}- \mathbf{b}_n)^T \boldsymbol{\Phi}_n (\boldsymbol{\beta}- \mathbf{b}_n) } \end{eqnarray*}\]

Marginal \[\begin{eqnarray*} p(\phi \mid \mathbf{Y}) & \propto & \phi^{\frac {n + \nu_0}{ 2} - 1} e^{- \frac \phi 2 ( \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n)} \times \\ & & \int_{\mathbb{R}^p} (2 \pi)^{- \frac p 2} |\phi \Phi_n |^{\frac 1 2}e^{- \frac{\phi}{2} (\boldsymbol{\beta}- \mathbf{b}_n)^T \boldsymbol{\Phi}_n (\boldsymbol{\beta}- \mathbf{b}_n) \ d\boldsymbol{\beta}} \\ & = & \phi^{\frac {n + \nu_0}{ 2} - 1} e^{- \frac \phi 2 ( \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n)} \end{eqnarray*}\]

  • Conditional Normal for \(\boldsymbol{\beta}\mid \phi, \mathbf{Y}\) and marginal Gamma for \(\phi \mid \mathbf{Y}\)

  • No need for Gibbs sampling!

\(\textsf{NG}\) Posterior Distribution

\[\begin{eqnarray*} \boldsymbol{\beta}\mid \phi, \mathbf{Y}& \sim &\textsf{N}(\mathbf{b}_n, (\phi \boldsymbol{\Phi}_n)^{-1}) \\ \phi \mid \mathbf{Y}&\sim &\textsf{Gamma}(\frac{\nu_n}{2}, \frac{\textsf{SS}_n}{2}) \\ (\boldsymbol{\beta}, \phi) \mid \mathbf{Y}& \sim & \textsf{NG}(\mathbf{b}_n, \boldsymbol{\Phi}_n, \nu_n, \textsf{SS}_n) \end{eqnarray*}\]

Hyperparameters: \[\begin{align*} \Phi_n & = \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0 & \quad \mathbf{b}_n & = \boldsymbol{\Phi}_n^{-1} (\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \\ \nu_n & = n + \nu_0 & \quad \textsf{SS}_n & = \textsf{SS}_0 + \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0 \mathbf{b}_0 - \mathbf{b}_n^T \boldsymbol{\Phi}_n \mathbf{b}_n \end{align*} \]

\[\begin{align*} \textsf{SS}_n & = \textsf{SS}_0 + \| \mathbf{Y}- \mathbf{X}\mathbf{b}_n \|^2 + (\mathbf{b}_0 - \mathbf{b}_n)^T \boldsymbol{\Phi}_0 (\mathbf{b}_0 - \mathbf{b}_n) \\ & = \textsf{SS}_0 + \| \mathbf{Y}- \mathbf{X}\mathbf{b}_n \|^2 + \| \mathbf{b}_0 - \mathbf{b}_n \|^2_{\boldsymbol{\Phi}_0} \end{align*}\]

  • Inner product induced by prior precision \(\langle u, v \rangle_A \equiv u^TAv\)

  • \(\| \mathbf{b}_0 - \mathbf{b}_n \|^2_{\boldsymbol{\Phi}_0}\) mismatch of prior and posterior mean under prior

Marginal Distribution

Theorem: Student-t
Let \(\boldsymbol{\theta}\mid \phi \sim \textsf{N}(m, \frac{1}{\phi} \boldsymbol{\Sigma})\) and \(\phi \sim \textsf{Gamma}(\nu/2, \nu {\hat{\sigma}}^2/2)\).

Then \(\boldsymbol{\theta}\) \((p \times 1)\) has a \(p\) dimensional multivariate \(t\) distribution \[\boldsymbol{\theta}\sim t_\nu( m, {\hat{\sigma}}^2\boldsymbol{\Sigma})\] with location \(m\), scale matrix \({\hat{\sigma}}^2\boldsymbol{\Sigma}\) and density

\[p(\boldsymbol{\theta}) \propto \left[ 1 + \frac{1}{\nu} \frac{ (\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)}{{\hat{\sigma}}^2} \right]^{- \frac{p + \nu}{2}}\]

Note - true for prior or posterior given \(\mathbf{Y}\)

Derivation

Marginal density \(p(\boldsymbol{\theta}) = \int_0^\infty p(\boldsymbol{\theta}\mid \phi) p(\phi) \, d\phi\)

\[\begin{eqnarray*} p(\boldsymbol{\theta}) & \propto & \int | \boldsymbol{\Sigma}/\phi|^{-1/2} e^{- \frac{\phi}{2} (\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)} \phi^{\nu/2 - 1} e^{- \phi \frac{\nu {\hat{\sigma}}^2}{2}}\, d \phi \\ & \propto & \int \phi^{p/2} \phi^{\nu/2 - 1} e^{- \phi \frac{(\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)+ \nu {\hat{\sigma}}^2}{2}}\, d \phi \\ & \propto & \int \phi^{\frac{p +\nu}{2} - 1} e^{- \phi \frac{(\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)+ \nu {\hat{\sigma}}^2}{2}} \, d \phi \\ & = & \Gamma((p + \nu)/2 ) \left( \frac{(\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)+ \nu {\hat{\sigma}}^2}{2} \right)^{- \frac{p + \nu}{2}} \\ & \propto & \left( (\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)+ \nu {\hat{\sigma}}^2\right)^{- \frac{p + \nu}{2}} \\ & \propto & \left( 1 + \frac{1}{\nu} \frac{(\boldsymbol{\theta}- m)^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\theta}- m)}{{\hat{\sigma}}^2} \right)^{- \frac{p + \nu}{2}} \end{eqnarray*}\]

Marginal Posterior Distribution of \(\boldsymbol{\beta}\)

\[\begin{eqnarray*} \boldsymbol{\beta}\mid \phi, \mathbf{Y}& \sim & \textsf{N}( \mathbf{b}_n, \phi^{-1} \boldsymbol{\Phi}_n^{-1}) \\ \phi \mid \mathbf{Y}& \sim & \textsf{Gamma}\left(\frac{\nu_n}{2}, \frac{\textsf{SS}_n}{ 2} \right) \end{eqnarray*}\]

  • Let \({\hat{\sigma}}^2= \textsf{SS}_n/\nu_n\) (Bayesian MSE)

  • The marginal posterior distribution of \(\boldsymbol{\beta}\) is multivariate Student-t \[ \boldsymbol{\beta}\mid \mathbf{Y}\sim t_{\nu_n} (\mathbf{b}_n, {\hat{\sigma}}^2\boldsymbol{\Phi}_n^{-1}) \]

  • Any linear combination \(\lambda^T\boldsymbol{\beta}\) has a univariate \(t\) distribution with \(\mathbf{v}_n\) degrees of freedom \[\lambda^T\boldsymbol{\beta}\mid \mathbf{Y}\sim t_{\nu_n} (\lambda^T\mathbf{b}_n, {\hat{\sigma}}^2\lambda^T\Phi_n^{-1}\lambda)\]

  • use for individual \(\boldsymbol{\beta}_j\), the mean of \(Y\), \(\mathbf{x}^T \boldsymbol{\beta}\), at \(\mathbf{x}\), or predictions \(Y^* = {\mathbf{x}^*}^T \boldsymbol{\beta}+ \epsilon_i^*\)

Predictive Distributions

Suppose \(\mathbf{Y}^* \mid \boldsymbol{\beta}, \phi \sim \textsf{N}_s(\mathbf{X}^* \boldsymbol{\beta}, \mathbf{I}_s/\phi)\) and is conditionally independent of \(\mathbf{Y}\) given \(\boldsymbol{\beta}\) and \(\phi\)

  • What is the predictive distribution of \(\mathbf{Y}^* \mid \mathbf{Y}\)?

  • Use the representation that \(\mathbf{Y}^* \mathrel{\mathop{=}\limits^{\rm D}}\mathbf{X}^* \boldsymbol{\beta}+ \boldsymbol{\epsilon}^*\) and \(\boldsymbol{\epsilon}^*\) is independent of \(\mathbf{Y}\) given \(\phi\)

\[\begin{eqnarray*} \mathbf{X}^* \boldsymbol{\beta}+ \boldsymbol{\epsilon}^* \mid \phi, \mathbf{Y}& \sim & \textsf{N}(\mathbf{X}^*\mathbf{b}_n, (\mathbf{X}^{*} \boldsymbol{\Phi}_n^{-1} \mathbf{X}^{*T} + \mathbf{I}_s)/\phi) \\ \mathbf{Y}^* \mid \phi, \mathbf{Y}& \sim & \textsf{N}(\mathbf{X}^*\mathbf{b}_n, (\mathbf{X}^{*} \Phi_n^{-1} \mathbf{X}^{*T} + \mathbf{I}_s)/\phi) \\ \phi \mid \mathbf{Y}& \sim & \textsf{Gamma}\left(\frac{\nu_n}{2}, \frac{{\hat{\sigma}}^2\nu_n}{ 2} \right) \end{eqnarray*}\]

  • Use the Theorem to conclude that \[\mathbf{Y}^* \mid \mathbf{Y}\sim t_{\nu_n}( \mathbf{X}^*\mathbf{b}_n, {\hat{\sigma}}^2(\mathbf{I}+ \mathbf{X}^* \boldsymbol{\Phi}_n^{-1} \mathbf{X}^T))\]

Choice of Conjugate (or Semi-Conjugate) Prior

  • need to specify Normal prior mean \(\mathbf{b}_0\) and precision \(\boldsymbol{\Phi}_0\)

  • need to specify Gamma shape (\(\nu_o\) prior df) and rate (estimate of \(\sigma^2\))

  • hard in higher dimensions!

  • default choices?

    • Jeffreys’ prior
    • unit-information prior
    • Zellner’s g-prior
    • ridge priors
    • mixtures of conjugate priors

Jeffreys’ Prior

  • Jeffreys prior is invariant to model parameterization of \(\boldsymbol{\theta}= (\boldsymbol{\beta},\phi)\) \[p(\boldsymbol{\theta}) \propto |{\cal{I}}(\boldsymbol{\theta})|^{1/2}\]

  • \({\cal{I}}(\boldsymbol{\theta})\) is the Expected Fisher Information matrix \[{\cal{I}}(\theta) = - \textsf{E}[ \left[ \frac{\partial^2 \log({\cal{L}}(\boldsymbol{\theta}))}{\partial \theta_i \partial \theta_j} \right] ]\]

  • log likelihood expressed as function of sufficient statistics

\[\log({\cal{L}}(\boldsymbol{\beta}, \phi)) = \frac{n}{2} \log(\phi) - \frac{\phi}{2} \| (\mathbf{I}_n - \mathbf{P}_\mathbf{x}) \mathbf{Y}\|^2 - \frac{\phi}{2}(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^T(\mathbf{X}^T\mathbf{X})(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})\]

  • projection matrix \(\mathbf{P}_{\mathbf{X}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\)

Information matrix

\[\begin{eqnarray*} \frac{\partial^2 \log {\cal{L}}} { \partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^T} & = & \left[ \begin{array}{cc} -\phi (\mathbf{X}^T\mathbf{X}) & -(\mathbf{X}^T\mathbf{X}) (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}}) \\ - (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^T (\mathbf{X}^T\mathbf{X}) & -\frac{n}{2} \frac{1}{\phi^2} \\ \end{array} \right] \\ \textsf{E}[\frac{\partial^2 \log {\cal{L}}} { \partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^T}] & = & \left[ \begin{array}{cc} -\phi (\mathbf{X}^T\mathbf{X}) & \mathbf{0}_p \\ \mathbf{0}_p^T & -\frac{n}{2} \frac{1}{\phi^2} \\ \end{array} \right] \\ & & \\ {\cal{I}}((\boldsymbol{\beta}, \phi)^T) & = & \left[ \begin{array}{cc} \phi (\mathbf{X}^T\mathbf{X}) & \mathbf{0}_p \\ \mathbf{0}_p^T & \frac{n}{2} \frac{1}{\phi^2} \end{array} \right] \end{eqnarray*}\]

Jeffreys’ Prior (don’t use!) \[\begin{eqnarray*} p_J(\boldsymbol{\beta}, \phi) & \propto & |{\cal{I}}((\boldsymbol{\beta}, \phi)^T) |^{1/2} = |\phi \mathbf{X}^T\mathbf{X}|^{1/2} \left(\frac{n}{2} \frac{1}{\phi^2} \right)^{1/2} \propto \phi^{p/2 - 1} |\mathbf{X}^T\mathbf{X}|^{1/2} \\ & \propto & \phi^{p/2 - 1} \end{eqnarray*}\]

Formal Posterior Distribution

  • Use Independent Jeffreys Prior \(p_{IJ}(\beta, \phi) \propto p_{IJ}(\boldsymbol{\beta}) p_{IJ}(\phi) = \phi^{-1}\)

  • Formal Posterior Distribution \[\begin{eqnarray*} \boldsymbol{\beta}\mid \phi, \mathbf{Y}& \sim & \textsf{N}(\hat{\boldsymbol{\beta}}, (\mathbf{X}^T\mathbf{X})^{-1} \phi^{-1}) \\ \phi \mid \mathbf{Y}& \sim& \textsf{Gamma}((n-p)/2, \| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2/2) \\ \boldsymbol{\beta}\mid \mathbf{Y}& \sim & t_{n-p}(\hat{\boldsymbol{\beta}}, {\hat{\sigma}}^2(\mathbf{X}^T\mathbf{X})^{-1}) \end{eqnarray*}\]

  • Bayesian Credible Sets \(p(\boldsymbol{\beta}\in C_\alpha) \mid \mathbf{Y}) = 1- \alpha\) correspond to frequentist Confidence Regions \[\frac{\mathbf{x}^T\boldsymbol{\beta}- \mathbf{x}^T \hat{\boldsymbol{\beta}}} {\sqrt{{\hat{\sigma}}^2\mathbf{x}^T(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}} }\sim t_{n-p}\]

  • conditional on \(\mathbf{Y}\) for Bayes and conditional on \(\boldsymbol{\beta}\) for frequentist

Other priors next

Invariance and Choice of Mean/Precision

  • the model in vector form \(Y \mid \beta, \phi \sim \textsf{N}_n (X\beta, \phi^{-1} I_n)\)

  • What if we transform the mean \(X\beta = X H H^{-1} \beta\) with new \(X\) matrix \(\tilde{X} = X H\) where \(H\) is \(p \times p\) and invertible and coefficients \(\tilde{\beta} = H^{-1} \beta\).

  • obtain the posterior for \(\tilde{\beta}\) using \(Y\) and \(\tilde{X}\)
    \[ Y \mid \tilde{\beta}, \phi \sim \textsf{N}_n (\tilde{X}\tilde{\beta}, \phi^{-1} I_n)\]

  • since \(\tilde{X} \tilde{\beta} = X H \tilde{\beta} = X \beta\) invariance suggests that the posterior for \(\beta\) and \(H \tilde{\beta}\) should be the same

  • plus the posterior of \(H^{-1} \beta\) and \(\tilde{\beta}\) should be the same

Exercise for the Energetic Student

With some linear algebra, show that this is true for a normal prior if \(b_0 = 0\) and \(\Phi_0\) is \(k X^TX\) for some \(k\)

Zellner’s g-prior

  • Popular choice is to take \(k = \phi/g\) which is a special case of Zellner’s g-prior \[\beta \mid \phi, g \sim \textsf{N}\left(0, \frac{g}{\phi} (X^TX)^{-1}\right)\]

  • Full conditional \[\beta \mid \phi, g \sim \textsf{N}\left(\frac{g}{1 + g} \hat{\beta}, \frac{1}{\phi} \frac{g}{1 + g} (X^TX)^{-1}\right)\]

  • one parameter \(g\) controls shrinkage

  • if \(\phi \sim \textsf{Gamma}(v_0/2, s_0/2)\) then posterior is \[\phi \mid y_1, \ldots, y_n \sim \textsf{Gamma}(v_n/2, s_n/2)\]

  • Conjugate so we could skip Gibbs sampling and sample directly from gamma and then conditional normal!

Ridge Regression

  • If \(X^TX\) is nearly singular, certain elements of \(\beta\) or (linear combinations of \(\beta\)) may have huge variances under the \(g\)-prior (or flat prior) as the MLEs are highly unstable!

  • Ridge regression protects against the explosion of variances and ill-conditioning with the conjugate priors: \[\beta \mid \phi \sim \textsf{N}(0, \frac{1}{\phi \lambda} I_p)\]

  • Posterior for \(\beta\) (conjugate case) \[\beta \mid \phi, \lambda, y_1, \ldots, y_n \sim \textsf{N}\left((\lambda I_p + X^TX)^{-1} X^T Y, \frac{1}{\phi}(\lambda I_p + X^TX)^{-1} \right)\]

Bayes Regression

  • Posterior mean (or mode) given \(\lambda\) is biased, but can show that there always is a value of \(\lambda\) where the frequentist’s expected squared error loss is smaller for the Ridge estimator than MLE!

  • related to penalized maximum likelihood estimation

  • Choice of \(\lambda\)

  • Bayes Regression and choice of \(\Phi_0\) in general is a very important problem and provides the foundation for many variations on shrinkage estimators, variable selection, hierarchical models, nonparameteric regression and more!

  • Be sure that you can derive the full conditional posteriors for \(\beta\) and \(\phi\) as well as the joint posterior in the conjugate case!