STA702
Duke University
Regression Model (Sampling model) \[\mathbf{Y}\mid \boldsymbol{\beta}, \phi \sim \textsf{N}(\mathbf{X}\boldsymbol{\beta}, \phi^{-1} \mathbf{I}_n) \]
Conjugate Normal-Gamma Model: factor joint prior \(p(\boldsymbol{\beta}, \phi ) = p(\boldsymbol{\beta}\mid \phi)p(\phi)\) \[\begin{align*} \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{b}_0, \phi^{-1}\boldsymbol{\Phi}_0^{-1}) & p(\boldsymbol{\beta}\mid \phi) & = \frac{|\phi \boldsymbol{\Phi}_0|^{1/2}}{(2 \pi)^{p/2}}e^{\left\{- \frac{\phi}{2}(\boldsymbol{\beta}- \mathbf{b}_0)^T \boldsymbol{\Phi}_0 (\boldsymbol{\beta}- \mathbf{b}_0) \right\}}\\ \phi & \sim \textsf{Gamma}(v_0/2, \textsf{SS}_0/2) & p(\phi) & = \frac{1}{\Gamma{(\nu_0/2)}} \left(\frac{\textsf{SS}_0}{2} \right)^{\nu_0/2} \phi^{\nu_0/2 - 1} e^{- \phi \frac{\textsf{SS}_0}{2}}\\ \Rightarrow (\boldsymbol{\beta}, \phi) & \sim \textsf{NG}(\mathbf{b}_0, \boldsymbol{\Phi}_0, \nu_o, \textsf{SS}_0) \end{align*}\]
Need to specify the 4 hyperparameters of the Normal-Gamma distribution!
hard in higher dimensions!
Seek default choices
Which? Why?
Jeffreys prior is invariant to model parameterization of \(\boldsymbol{\theta}= (\boldsymbol{\beta},\phi)\) \[p(\boldsymbol{\theta}) \propto |{\cal{I}}(\boldsymbol{\theta})|^{1/2}\]
\({\cal{I}}(\boldsymbol{\theta})\) is the Expected Fisher Information matrix \[{\cal{I}}(\theta) = - \textsf{E}[ \left[ \frac{\partial^2 \log({\cal{L}}(\boldsymbol{\theta}))}{\partial \theta_i \partial \theta_j} \right] ]\]
log likelihood expressed as function of sufficient statistics
\[\log({\cal{L}}(\boldsymbol{\beta}, \phi)) = \frac{n}{2} \log(\phi) - \frac{\phi}{2} \| (\mathbf{I}_n - \mathbf{P}_\mathbf{x}) \mathbf{Y}\|^2 - \frac{\phi}{2}(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^T(\mathbf{X}^T\mathbf{X})(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})\]
\[\begin{eqnarray*} \frac{\partial^2 \log {\cal{L}}} { \partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^T} & = & \left[ \begin{array}{cc} -\phi (\mathbf{X}^T\mathbf{X}) & -(\mathbf{X}^T\mathbf{X}) (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}}) \\ - (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^T (\mathbf{X}^T\mathbf{X}) & -\frac{n}{2} \frac{1}{\phi^2} \\ \end{array} \right] \\ \textsf{E}[\frac{\partial^2 \log {\cal{L}}} { \partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^T}] & = & \left[ \begin{array}{cc} -\phi (\mathbf{X}^T\mathbf{X}) & \mathbf{0}_p \\ \mathbf{0}_p^T & -\frac{n}{2} \frac{1}{\phi^2} \\ \end{array} \right] \\ & & \\ {\cal{I}}((\boldsymbol{\beta}, \phi)^T) & = & \left[ \begin{array}{cc} \phi (\mathbf{X}^T\mathbf{X}) & \mathbf{0}_p \\ \mathbf{0}_p^T & \frac{n}{2} \frac{1}{\phi^2} \end{array} \right] \end{eqnarray*}\]
\[\begin{eqnarray*} p_J(\boldsymbol{\beta}, \phi) & \propto & |{\cal{I}}((\boldsymbol{\beta}, \phi)^T) |^{1/2} = |\phi \mathbf{X}^T\mathbf{X}|^{1/2} \left(\frac{n}{2} \frac{1}{\phi^2} \right)^{1/2} \propto \phi^{p/2 - 1} |\mathbf{X}^T\mathbf{X}|^{1/2} \\ & \propto & \phi^{p/2 - 1} \end{eqnarray*}\]
Jeffreys’ did not recommend - marginal for \(\phi\) dies not account for dimension \(p\)
\[ {\cal{I}}((\boldsymbol{\beta}, \phi)^T) = \left[ \begin{array}{cc} \phi (\mathbf{X}^T\mathbf{X}) & \mathbf{0}_p \\ \mathbf{0}_p^T & \frac{n}{2} \frac{1}{\phi^2} \end{array} \right] \]
\[\begin{align*} p_{IJ}(\boldsymbol{\beta}) & \propto |\phi \mathbf{X}^T\mathbf{X}|^{1/2} \propto 1 \\ p_{IJ}(\phi) & \propto \phi^{-1} \\ p_{IJ}(\beta, \phi) & \propto p_{IJ}(\boldsymbol{\beta}) p_{IJ}(\phi) = \phi^{-1} \end{align*}\]
Two group reference prior
Use Independent Jeffreys Prior \(p_{IJ}(\beta, \phi) \propto p_{IJ}(\boldsymbol{\beta}) p_{IJ}(\phi) = \phi^{-1}\)
Formal Posterior Distribution \[\begin{eqnarray*} \boldsymbol{\beta}\mid \phi, \mathbf{Y}& \sim & \textsf{N}(\hat{\boldsymbol{\beta}}, (\mathbf{X}^T\mathbf{X})^{-1} \phi^{-1}) \\ \phi \mid \mathbf{Y}& \sim& \textsf{Gamma}((n-p)/2, \| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2/2) \\ \boldsymbol{\beta}\mid \mathbf{Y}& \sim & t_{n-p}(\hat{\boldsymbol{\beta}}, {\hat{\sigma}}^2(\mathbf{X}^T\mathbf{X})^{-1}) \end{eqnarray*}\]
Bayesian Credible Sets \(p(\boldsymbol{\beta}\in C_\alpha\mid \mathbf{Y}) = 1- \alpha\) correspond to frequentist Confidence Regions \[\frac{\mathbf{x}^T\boldsymbol{\beta}- \mathbf{x}^T \hat{\boldsymbol{\beta}}} {\sqrt{{\hat{\sigma}}^2\mathbf{x}^T(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}} }\sim t_{n-p}\]
conditional on \(\mathbf{Y}\) for Bayes and conditional on \(\boldsymbol{\beta}\) for frequentist
Unit information prior \(\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\hat{\boldsymbol{\beta}}, n (\mathbf{X}^T\mathbf{X})^{-1}/\phi)\)
\[\log(p(\boldsymbol{\beta}, \phi) \propto \frac{1}{n}\frac{n}{2} \log(\phi) - \frac{\phi}{2} \frac{\| (\mathbf{I}_n - \mathbf{P}_\mathbf{x}) \mathbf{Y}\|^2}{n} - \frac{\phi}{2}(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^T\frac{(\mathbf{X}^T\mathbf{X})}{n}(\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})\]
``average information’’ in one observation is \(\phi \mathbf{X}^T\mathbf{X}/n\) or “unit information”
Posterior mean \(\frac{n}{1 + n} \hat{\boldsymbol{\beta}}+ \frac{1}{1 + n}\hat{\boldsymbol{\beta}}= \hat{\boldsymbol{\beta}}\)
Posterior Distribution \[\boldsymbol{\beta}\mid \mathbf{Y}, \phi \sim \textsf{N}\left( \hat{\boldsymbol{\beta}}, \frac{n}{1 + n} (\mathbf{X}^T\mathbf{X})^{-1} \phi^{-1} \right)\]
Advantages:
Disadvantages
Exercise for the Energetic Student
the model in vector form \(Y \mid \beta, \phi \sim \textsf{N}_n (X\beta, \phi^{-1} I_n)\)
What if we transform the mean \(X\beta = X H H^{-1} \beta\) with new \(X\) matrix \(\tilde{X} = X H\) where \(H\) is \(p \times p\) and invertible and coefficients \(\tilde{\beta} = H^{-1} \beta\).
obtain the posterior for \(\tilde{\beta}\) using \(Y\) and \(\tilde{X}\)
\[ Y \mid \tilde{\beta}, \phi \sim \textsf{N}_n (\tilde{X}\tilde{\beta}, \phi^{-1} I_n)\]
since \(\tilde{X} \tilde{\beta} = X H \tilde{\beta} = X \beta\) invariance suggests that the posterior for \(\beta\) and \(H \tilde{\beta}\) should be the same
plus the posterior of \(H^{-1} \beta\) and \(\tilde{\beta}\) should be the same
Exercise for the Energetic Student
With some linear algebra, show that this is true for a normal prior if \(b_0 = 0\) and \(\Phi_0\) is \(k X^TX\) for some \(k\)
Popular choice is to take \(k = \phi/g\) which is a special case of Zellner’s g-prior \[\beta \mid \phi, g \sim \textsf{N}\left(\mathbf{b}_0, \frac{g}{\phi} (X^TX)^{-1}\right)\]
Full conditional \[\beta \mid \phi, g \sim \textsf{N}\left(\frac{g}{1 + g} \hat{\beta} + \frac{1}{1 + g}\mathbf{b}_0, \frac{1}{\phi} \frac{g}{1 + g} (X^TX)^{-1}\right)\]
one parameter \(g\) controls shrinkage
invariance under linear transformations of \(\mathbf{X}\) with \(\mathbf{b}_0 = 0\) or transform mean \(\tilde{\mathbf{b}}_0 = H^{-1}\mathbf{b}_0\)
often paired with the Jeffereys’ reference prior for \(\phi\)
allows an informative mean, but keeps the same correlation structure as the MLE
Zellner also realized that different blocks might have different degrees of prior information
Two blocks \(\mathbf{X}_1\) and \(\mathbf{X}_2\) with \(\mathbf{X}_1^T \mathbf{X}_2 = 0\) so Fisher Information is block diagonal
Model \(\mathbf{Y}= \mathbf{X}_1 \boldsymbol{\alpha}+ \mathbf{X}_2 \boldsymbol{\beta}+ \boldsymbol{\epsilon}\)
Priors \[\begin{align} \boldsymbol{\alpha}\mid \phi & \sim \textsf{N}(\boldsymbol{\alpha}_1, \frac{g_{\boldsymbol{\alpha}}}{\phi} (\mathbf{X}_1^T\mathbf{X}_1)^{-1})\\ \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{b}_0, \frac{g_{\boldsymbol{\beta}}}{\phi} (\mathbf{X}_2^T\mathbf{X}_2)^{-1}) \end{align}\]
Important case \(\mathbf{X}_1 = \mathbf{1}_n\) corresponding to intercept with limiting case \(g_{\boldsymbol{\alpha}} \to \infty\) \[p(\boldsymbol{\alpha}) \propto 1\]
The posterior in Jeffereys’ prior(s), the unit information prior, and Zellner’s g-priors depend on \((\mathbf{X}^T\mathbf{X})^{-1}\) and the MLE \(\hat{\boldsymbol{\beta}}\)
If \(\mathbf{X}^T\mathbf{X}= \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^T\) is nearly singular (\(\lambda_j \approx 0\) for one or more eigenvalues), certain elements of \(\beta\) or (linear combinations of \(\beta\)) may have huge posterior variances and the MLEs (and posterior means) are highly unstable!
there is no unique posterior distribution if any \(\lambda_j = 0\)! (\(p > n\) or non-full rank)
Posterior Precision and Mean in conjugate prior \[\begin{align} \boldsymbol{\Phi}_n & = \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0 \\ \mathbf{b}_n & = \boldsymbol{\Phi}^{-1} (\mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \end{align}\]
Need a proper prior with \(\boldsymbol{\Phi}_0 >0\) (OK for \(\mathbf{b}_0 = 0\) )
Simplest case: take \(\boldsymbol{\Phi}_0 = \kappa \mathbf{I}_p\) so that \(\boldsymbol{\Phi}_n = \mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I}_p = \mathbf{U}(\boldsymbol{\Lambda}+ \kappa \mathbf{I}_p) \mathbf{U}^T > 0\)
Model: \(\mathbf{Y}= \mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\)
WLOG assume that \(\mathbf{X}\) has been centered and scaled so that \(\mathbf{X}^T\mathbf{X}= \textsf{corr}(\mathbf{X})\)
typically expect the intercept \(\alpha\) to be a different order of magnitude from the other predictors.
Prior \(\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\mathbf{0}_b, \frac{1} {\phi \kappa} \mathbf{I}_p\)) implies the \(\mathbf{b}\) are exchangable a priori (i.e. distribution is invariant under permuting the labels and with a common scale and mean)
Posterior for \(\boldsymbol{\beta}\) \[\boldsymbol{\beta}\mid \phi, \kappa, \mathbf{Y}\sim \textsf{N}\left((\kappa I_p + X^TX)^{-1} X^T Y, \frac{1}{\phi}(\kappa I_p + X^TX)^{-1} \right)\]
Posterior mean (or mode) given \(\kappa\) is biased, but can show that there always is a value of \(\kappa\) where the frequentist’s expected squared error loss is smaller for the Ridge estimator than MLE!
Unfortunately the optimal choice depends on “true” \(\boldsymbol{\beta}\)!
related to penalized maximum likelihood estimation \[-\frac{\phi}{2}\left(\|\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}\|^2 + \kappa \| \boldsymbol{\beta}\|^2 \right) \]
Choice of \(\kappa\) ?
Should there be a common \(\kappa\)? Or a \(\kappa_j\) per variable? (or shared in a group?)
can place a prior on \(\kappa\) or \(\kappa_j\) for fully Bayes
similar issue for \(g\) in the \(g\) priors
often improved robustness over fixed choices of hyperparameter
may not have cloosed form posterior but sampling is still often easy!
Examples: Bayesian Lasso, Double Laplace, Horseshoe prior, mixtures of \(g\)-priors