STA702
Duke University
Metropolis requires that the proposal distribution be symmetric
Hastings (1970) generalizes Metropolis algorithms to allow asymmetric proposals - aka Metropolis-Hastings or MH \(q(\theta^* \mid \theta^{(s)})\) does not need to be the same as \(q(\theta^{(s)} \mid \theta^*)\)
propose \(\theta^* \mid \theta^{(s)} \sim q(\theta^* \mid \theta^{(s)})\)
Acceptance probability \[\min \left\{ 1, \frac{\pi(\theta^*) \cal{L}(\theta^*)/q(\theta^* \mid \theta^{(s)})} {\pi(\theta^{(s)}) \cal{L}(\theta^{(s)})/q( \theta^{(s)} \mid \theta^*)} \right\}\]
adjustment for asymmetry in acceptance ratio is key to ensuring convergence to stationary distribution!
Metropolis
Independence chain
Gibbs samplers
Metropolis-within-Gibbs
combinations of the above!
suppose we have a good approximation \(\tilde{\pi}(\theta \mid y)\) to \(\pi(\theta \mid y)\)
Draw \(\theta^* \sim \tilde{\pi}(\theta \mid y)\) without conditioning on \(\theta^{(s)}\)
acceptance probability \[\min \left\{ 1, \frac{\pi(\theta^*) \cal{L}(\theta^*)/\tilde{\pi}(\theta^* \mid \theta^{(s)})} {\pi(\theta^{(s)}) \cal{L}(\theta^{(s)})/\tilde{\pi}( \theta^{(s)} \mid \theta^*)} \right\}\]
what happens if the approximation is really accurate?
probability of acceptance is \(\approx 1\)
Important caveat for convergence: tails of the proposalr should be at least as heavy as the tails of the posterior (Tweedie 1994)
Replace Gaussian by a Student-t with low degrees of freedom
transformations of \(\theta\) to imporove approximation
So far all algorithms update all of the parameters simultaneously
convenient to break problems in to \(K\) blocks and update them separately
\(\theta = (\theta_{[1]}, \ldots, \theta_{[K]}) = (\theta_1, \ldots, \theta_p)\)
At iteration \(s\), for \(k = 1, \ldots, K\) Cycle thru blocks: (fixed order or random order)
propose \(\theta^*_{[k]} \sim q_k(\theta_{[k]} \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})\)
set \(\theta_{[k]}^{(s)} = \theta^*_{[k]}\) with probability \[\min \left\{ 1, \frac{ \pi(\theta_{[<k]}^{(s)},\theta_{[k]}^*, \theta_{[>k]}^{(s-1)}) \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^*, \theta_{[>k]}^{(s-1)})/ q_k(\theta_{[k]}^* \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})} {\pi(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)}, \theta_{[>k]}^{(s-1)}) \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)}, \theta_{[>k]}^{(s-1)})/ q_k(\theta_{[k]}^{(s-1)} \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})} \right\}\]
The Gibbs Sampler is special case of Blocked MH
proposal distribution \(q_k\) for the \(k\)th block is the full conditional distribution for \(\theta_{[k]}\) \[\begin{split} \pi(\theta_{[k]} \mid \theta_{[-k]}, y) & = \frac{\pi(\theta_{[k]} , \theta_{[-k]} \mid y)}{ \pi(\theta_{[-k]} \mid y))} \propto \pi(\theta_{[k]} , \theta_{[-k]} \mid y)\\ \ & \propto \cal{L}(\theta_{[k]} , \theta_{[-k]})\pi(\theta_{[k]} , \theta_{[-k]}) \end{split}\]
Acceptance probability \[\min \left\{ 1, \frac{ \pi(\theta_{[<k]}^{(s)},\theta_{[k]}^*, \theta_{[>k]}^{(s-1)}) \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^*, \theta_{[>k]}^{(s-1)})/ q_k(\theta_{[k]}^* \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})} {\pi(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)}, \theta_{[>k]}^{(s-1)}) \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)}, \theta_{[>k]}^{(s-1)})/ q_k(\theta_{[k]}^{(s-1)} \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})} \right\}\]
Note normalizing constant in the proposal ratio cancels out and terms simplify so that acceptance probability is always 1!
even though joint distribution is messy, full conditionals may be (conditionally) conjugate and easy to sample from!
Model \[\begin{align*} Y_i \mid \mu, \sigma^2 & \overset{iid}{\sim} \textsf{N}(\mu, 1/\phi) \\ \mu & \sim \textsf{N}(\mu_0, 1/\tau_0) \\ \phi & \sim \textsf{Gamma}(a/2, b/2) \end{align*}\]
Joint prior is a product of independent Normal-Gamma
Is \(\pi(\mu, \phi \mid y_1, \ldots, y_n)\) also a Normal-Gamma family?
The full conditional distributions \(\mu \mid \phi, y_1, \ldots, y_n\) \[\begin{align*} \mu & \mid \phi, y_1, \ldots, y_n \sim \textsf{N}(\hat{\mu}, 1/\tau_n) \\ \hat{\mu} & = \frac{\tau_0 \mu_0 + n \phi \bar{y}}{\tau_0 + n \phi} \\ \tau_n & = \tau_0 + n \phi \end{align*}\]
\[\textsf{E}[\phi \mid \mu, y_1, \ldots, y_n] = \frac{(a + n)/2}{(b + \sum_i (y_i - \mu)^2 )/2}\]
Proper full conditionals with improper priors do not ensure proper joint posterior!
Model \[\begin{align*} Y_i \mid \beta, \phi & \overset{iid}{\sim} \textsf{N}(x_i^T\beta, 1/\phi) \\ Y \mid \beta, \phi & \sim \textsf{N}(X \beta, \phi^{-1} I_n) \\ \beta & \sim \textsf{N}(b_0, \Phi_0^{-1}) \\ \phi & \sim \textsf{Gamma}(v_0/2, s_0/2) \end{align*}\]
\(x_i\) is a \(p \times 1\) vector of predictors and \(X\) is \(n \times p\) matrix
\(\beta\) is a \(p \times 1\) vector of coefficients
\(\Phi_0\) is a \(p \times p\) prior precision matrix
Multivariate Normal density for \(\beta\) \[\pi(\beta \mid b_0, \Phi_0) = \frac{|\Phi_0|^{1/2}}{(2 \pi)^{p/2}}\exp\left\{- \frac{1}{2}(\beta - b_0)^T \Phi_0 (\beta - b_0) \right\}\] Note: stopped here 9/26/23 ## Full Conditional for \(\beta\) {.smaller}
\[\begin{align*} \beta & \mid \phi, y_1, \ldots, y_n \sim \textsf{N}(b_n, \Phi_n^{-1}) \\ b_n & = (\Phi_0 + \phi X^TX)^{-1}(\Phi_0 b_0 + \phi X^TX \hat{\beta})\\ \Phi_n & = \Phi_0 + \phi X^TX \end{align*}\]
\[\phi \mid \beta, y_1, \ldots, y_n \sim \textsf{Gamma}\left(\frac{v_0 + n}{2}, \frac{s_0 + \sum_i(y_i - x^T_i \beta)^2}{2}\right)\]
Non-Informative \(\Phi_0 \to 0\)
Formal Posterior given \(\phi\) \[\beta \mid \phi, y_1, \ldots, y_n \sim \textsf{N}(\hat{\beta}, \phi^{-1} (X^TX)^{-1})\]
needs \(X^TX\) to be full rank for distribution to be unique
the model in vector form \(Y \mid \beta, \phi \sim \textsf{N}_n (X\beta, \phi^{-1} I_n)\)
What if we transform the mean \(X\beta = X H H^{-1} \beta\) with new \(X\) matrix \(\tilde{X} = X H\) where \(H\) is \(p \times p\) and invertible and coefficients \(\tilde{\beta} = H^{-1} \beta\).
obtain the posterior for \(\tilde{\beta}\) using \(Y\) and \(\tilde{X}\)
\[ Y \mid \tilde{\beta}, \phi \sim \textsf{N}_n (\tilde{X}\tilde{\beta}, \phi^{-1} I_n)\]
since \(\tilde{X} \tilde{\beta} = X H \tilde{\beta} = X \beta\) invariance suggests that the posterior for \(\beta\) and \(H \tilde{\beta}\) should be the same
plus the posterior of \(H^{-1} \beta\) and \(\tilde{\beta}\) should be the same
Exercise for the Energetic Student
With some linear algebra, show that this is true for a normal prior if \(b_0 = 0\) and \(\Phi_0\) is \(k X^TX\) for some \(k\)
Popular choice is to take \(k = \phi/g\) which is a special case of Zellner’s g-prior \[\beta \mid \phi, g \sim \textsf{N}\left(0, \frac{g}{\phi} (X^TX)^{-1}\right)\]
Full conditional \[\beta \mid \phi, g \sim \textsf{N}\left(\frac{g}{1 + g} \hat{\beta}, \frac{1}{\phi} \frac{g}{1 + g} (X^TX)^{-1}\right)\]
one parameter \(g\) controls shrinkage
if \(\phi \sim \textsf{Gamma}(v_0/2, s_0/2)\) then posterior is \[\phi \mid y_1, \ldots, y_n \sim \textsf{Gamma}(v_n/2, s_n/2)\]
Conjugate so we could skip Gibbs sampling and sample directly from gamma and then conditional normal!
If \(X^TX\) is nearly singular, certain elements of \(\beta\) or (linear combinations of \(\beta\)) may have huge variances under the \(g\)-prior (or flat prior) as the MLEs are highly unstable!
Ridge regression protects against the explosion of variances and ill-conditioning with the conjugate priors: \[\beta \mid \phi \sim \textsf{N}(0, \frac{1}{\phi \lambda} I_p)\]
Posterior for \(\beta\) (conjugate case) \[\beta \mid \phi, \lambda, y_1, \ldots, y_n \sim \textsf{N}\left((\lambda I_p + X^TX)^{-1} X^T Y, \frac{1}{\phi}(\lambda I_p + X^TX)^{-1} \right)\]
Posterior mean (or mode) given \(\lambda\) is biased, but can show that there always is a value of \(\lambda\) where the frequentist’s expected squared error loss is smaller for the Ridge estimator than MLE!
related to penalized maximum likelihood estimation
Choice of \(\lambda\)
Bayes Regression and choice of \(\Phi_0\) in general is a very important problem and provides the foundation for many variations on shrinkage estimators, variable selection, hierarchical models, nonparameteric regression and more!
Be sure that you can derive the full conditional posteriors for \(\beta\) and \(\phi\) as well as the joint posterior in the conjugate case!
Comments
Why don’t we treat each individual \(\beta_j\) as a separate block?
Gibbs always accepts, but can mix slowly if parameters in different blocks are highly correlated!
Use block sizes in Gibbs that are as big as possible to improve mixing (proven faster convergence)
Collapse the sampler by integrating out as many parameters as possible (as long as resulting sampler has good mixing)
can use Gibbs steps and (adaptive) Metropolis Hastings steps together
Introduce latent variables (data augmentation) to allow Gibbs steps (Next class)
https://sta702-F23.github.io/website/