Lecture 7: MCMC Diagnostics & Adaptive Metropolis

STA702

Merlise Clyde

Duke University

Example from Last Class

  • Marginal Likelihood \[\cal{L}(\mu, \sigma^2, \sigma^2_\mu) \propto (\sigma^2 + \sigma^2_\mu)^{-n/2} \exp \left\{ - \frac{1}{2} \frac{\sum_{i=1}^n\left(y_i - \mu \right)^2}{\sigma^2 + \sigma^2_\mu }\right\}\]

  • Priors with \(\sigma^2 = 1\): \(p(\mu) \propto 1\) and \(\sigma_\mu \sim \textsf{Cauchy}^+(0,1)\) independent of \(\mu\)

  • Symmetric proposal for \(\mu\) and \(\sigma_\tau\)

  • Independent normals centered at current values of \(\mu\) and \(\sigma_\mu\) with covariance \(\frac{2.38^2}{d} \textsf{Cov}(\theta)\) where \(d = 2\) (the dimension of \(\theta\) )

  • \(\delta^2 = 2.38^2/d\) optimal for multivariate normal target Roberts, Gelman, and Gilks (1997) with acceptance rate ranging from 40% to 23.4% (as \(d \to \infty\))

Convergence diagnostics

  • Diagnostics available to help decide on number of burn-in & collected samples.

  • Note: no definitive tests of convergence but you should do as many diagnostics as you can, on all parameters in your model.

  • With “experience”, visual inspection of trace plots perhaps most useful approach.

  • There are a number of useful automated tests in R.

  • CAUTION: diagnostics cannot guarantee that a chain has converged, but they can indicate it has not converged.

Diagnostics in R

  • The most popular package for MCMC diagnostics in R is coda.

  • coda uses a special MCMC format so you must always convert your posterior matrix into an MCMC object.

  • For the example, we have the following in R.

#library(coda)
theta.mcmc <- mcmc(theta,start=1) #no burn-in (simple problem!)

Diagnostics in R

summary(theta.mcmc)

Iterations = 1:10000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

             Mean     SD Naive SE Time-series SE
mu       -0.07977 0.1046 0.001046       0.002839
sigma_mu  0.17550 0.1273 0.001273       0.004397

2. Quantiles for each variable:

              2.5%     25%      50%      75%  97.5%
mu       -0.283420 -0.1508 -0.08193 -0.00848 0.1337
sigma_mu  0.007995  0.0758  0.15024  0.25228 0.4693
  • The naive SE is the standard error of the mean, which captures simulation error of the mean rather than the posterior uncertainty.

  • The time-series SE adjusts the naive SE for autocorrelation.

Effective Sample Size

  • The effective sample size translates the number of MCMC samples \(S\) into an equivalent number of independent samples.

  • It is defined as \[\textrm{ESS} = \dfrac{S}{1 + 2 \sum_k \rho_k},\]

  • \(S\) is the sample size and \(\rho_k\) is the lag \(k\) autocorrelation.

  • For our data, we have

effectiveSize(theta.mcmc)
       mu  sigma_mu 
1356.6495  838.2613 
  • So our 10,000 samples are equivalent to 1356.6 independent samples for \(\mu\) and 838.3 independent samples for \(\sigma_\mu\).

Trace plot for mean

Trace plot for \(\sigma_\mu\)

OK (be careful of scaling in plots!)

Autocorrelation

  • Another way to evaluate convergence is to look at the autocorrelation between draws of our Markov chain.

  • The lag \(k\) autocorrelation, \(\rho_k\), is the correlation between each draw and its \(k\)th lag, defined as \[\rho_k = \dfrac{\sum_{s=1}^{S-k}(\theta_s - \bar{\theta})(\theta_{s+k} - \bar{\theta})}{\sum_{s=1}^{S-k}(\theta_s - \bar{\theta})^2}\]

  • We expect the autocorrelation to decrease as \(k\) increases.

  • If autocorrelation remains high as \(k\) increases, we have slow mixing due to the inability of the sampler to move around the space well.

Autocorrelation for mean

So-So

Autocorrelation for variance

worse

Gelman-Rubin

Gelman & Rubin suggested a diagnostic \(R\) based on taking separate chains with dispersed initial values to test convergence

Gelman-Rubin Diagnostic

  • Run m > 2 chains of length 2S from overdispersed starting values.

  • Discard the first S draws in each chain.

  • Calculate the pooled within-chain variance \(W\) and between-chain variance \(B\). \[R = \frac{\frac{S-1}{S} W + \frac{1}{S} B }{W}\]

  • numerator and denominator are both unbiased estimates of the variance if the two chains have converged

    • otherwise \(W\) is an underestimate (hasn’t explored enough)
    • numerator will overestimate as \(B\) is too large (overdispersed starting points)
  • As \(S \to \infty\) and \(B \to 0\), \(R \to 1\)

  • version in R is slightly different

Gelman-Rubin Diagnostic

theta.mcmc = mcmc.list(mcmc(theta1, start=5000), mcmc(theta2, start=5000))
gelman.diag(theta.mcmc)
Potential scale reduction factors:

         Point est. Upper C.I.
mu                1          1
sigma_mu          1          1

Multivariate psrf

1
  • Values of \(R > 1.1\) suggest lack of convergence

  • Looks OK

  • See also gelman.plot

Geweke statistic

  • Geweke proposed taking two non-overlapping parts of a single Markov chain (usually the first 10% and the last 50%) and comparing the mean of both parts, using a difference of means test

  • The null hypothesis would be that the two parts of the chain are from the same distribution.

  • The test statistic is a z-score with standard errors adjusted for autocorrelation, and if the p-value is significant for a variable, you need more draws.

  • Output in R is the Z score

Geweke Diagnostic

geweke.diag(theta.mcmc)
[[1]]

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

      mu sigma_mu 
 -0.7779   0.7491 


[[2]]

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

      mu sigma_mu 
  0.4454   0.6377 
  • The output is the z-score itself (not the p-value).

Practical advice on diagnostics

  • There are more tests we can use: Raftery and Lewis diagnostic, Heidelberger and Welch, etc.

  • The Gelman-Rubin approach is quite appealing in using multiple chains

  • Geweke (and Heidelberger and Welch) sometimes reject even when the trace plots look good.

  • Overly sensitive to minor departures from stationarity that do not impact inferences.

  • Most common method of assessing convergence is visual examination of trace plots.

Improving Results

  • more iterations and multiple chains

  • thinning to reduce correlations and increase ESS

  • change the proposal distribution \(q\)

  • adaptive Metropolis to tune \(q\)

Proposal Distribution

  • Common choice \[\textsf{N}(\theta^\star; \theta^{(s)}, \delta^2 \Sigma)\]

  • rough estimate of \(\Sigma\) based on the asymptotic Gaussian approximation \(\textsf{Cov}(\theta \mid y)\) and \(\delta = 2.38/\sqrt{\text{dim}(\theta)}\)

  • find the MAP estimate (posterior mode) \(\hat{\theta}\)

  • take \[\Sigma = \left[- \frac{\partial^2 \log(\cal{L}(\theta)) + \log(\pi(\theta))} {\partial \theta \partial \theta^T} \right]^{-1}_{\theta = \hat{\theta}}\]

  • ignore prior and use inverse of Fisher Information (covariance of MLE)

Learn Covariance in Proposal?

  • Can we learn the proposal distribution?

  • ad hoc?

    • run an initial MCMC for an initial tuning phase (e.g. 1000 samples) with a fixed \(\delta\) and estimate \(\Sigma(\theta)\) from samples.
    • run more to tweak \(\delta\) to get acceptance rate between \(23\%-40\%\).
    • fix the kernel for final run
  • MCMC doesn’t allow you to use the full history of the chain \(\theta^{(1)}, \ldots, \theta^{(s)}\) in constructing the proposal distributions as it violates the Markov assumption

  • even with no further “learning”, no guarantee we will converge to posterior!

  • more elegant approach - formal adaptive Metropolis

    • keep adapting the entire time!

ad hoc adaptation may mess up convergence !

Adaptive MCMC

  • run RWM with a Gaussian proposal for a fixed number of iterations for \(s < s_0\)

  • estimate of covariance at state \(s\) \[\Sigma^{(s)} = \frac{1}{s}\left(\sum_{i=1}^s \theta^{(i)} {\theta^{(i)}}^T - s \bar{\theta}^{(s)} {\bar{\theta}^{(s)}}^T\right)\]

  • proposal for \(s > s_0\) with \(\delta = 2.38/\sqrt{d}\) \[\theta^* \sim \textsf{N}(\theta^{(s)}, \delta^2 (\Sigma^{(s)} + \epsilon I_d))\]

  • \(\epsilon > 0\) insures covariance is positive definite

  • if \(s_0\) is too large will take longer for adaptation to be seen

  • need conditions for vanishing adaptation e.g. that the proposal depends less and less on recent states in the chain - see Roberts & Rosenthal (2009)for examples and other conditions

Example again

Acceptance rate now around 30-35 % of 10,000 iterations!