Lecture 17: Bayesian Variable Selection and Model Averaging

STA702

Merlise Clyde

Duke University

Diabetes Example

set.seed(8675309)
source("yX.diabetes.train.txt")
diabetes.train = as.data.frame(diabetes.train)
source("yX.diabetes.test.txt")
diabetes.test = as.data.frame(diabetes.test)
colnames(diabetes.test)[1] = "y"

str(diabetes.train)
'data.frame':   342 obs. of  65 variables:
 $ y      : num  -0.0147 -1.0005 -0.1444 0.6987 -0.2222 ...
 $ age    : num  0.7996 -0.0395 1.7913 -1.8703 0.113 ...
 $ sex    : num  1.064 -0.937 1.064 -0.937 -0.937 ...
 $ bmi    : num  1.296 -1.081 0.933 -0.243 -0.764 ...
 $ map    : num  0.459 -0.553 -0.119 -0.77 0.459 ...
 $ tc     : num  -0.9287 -0.1774 -0.9576 0.256 0.0826 ...
 $ ldl    : num  -0.731 -0.402 -0.718 0.525 0.328 ...
 $ hdl    : num  -0.911 1.563 -0.679 -0.757 0.171 ...
 $ tch    : num  -0.0544 -0.8294 -0.0544 0.7205 -0.0544 ...
 $ ltg    : num  0.4181 -1.4349 0.0601 0.4765 -0.6718 ...
 $ glu    : num  -0.371 -1.936 -0.545 -0.197 -0.979 ...
 $ age^2  : num  -0.312 -0.867 1.925 2.176 -0.857 ...
 $ bmi^2  : num  0.4726 0.1185 -0.0877 -0.6514 -0.2873 ...
 $ map^2  : num  -0.652 -0.573 -0.815 -0.336 -0.652 ...
 $ tc^2   : num  -0.091 -0.6497 -0.0543 -0.6268 -0.6663 ...
 $ ldl^2  : num  -0.289 -0.521 -0.3 -0.45 -0.555 ...
 $ hdl^2  : num  -0.0973 0.8408 -0.3121 -0.2474 -0.5639 ...
 $ tch^2  : num  -0.639 -0.199 -0.639 -0.308 -0.639 ...
 $ ltg^2  : num  -0.605 0.78 -0.731 -0.567 -0.402 ...
 $ glu^2  : num  -0.578 1.8485 -0.4711 -0.6443 -0.0258 ...
 $ age:sex: num  0.69 -0.139 1.765 1.609 -0.284 ...
 $ age:bmi: num  0.852 -0.142 1.489 0.271 -0.271 ...
 $ age:map: num  0.0349 -0.3346 -0.5862 1.1821 -0.3025 ...
 $ age:tc : num  -0.978 -0.246 -1.927 -0.72 -0.244 ...
 $ age:ldl: num  -0.803 -0.203 -1.504 -1.2 -0.182 ...
 $ age:hdl: num  -0.7247 0.0147 -1.2661 1.6523 0.1046 ...
 $ age:tch: num  -0.254 -0.176 -0.31 -1.598 -0.216 ...
 $ age:ltg: num  0.0644 -0.2142 -0.163 -1.1657 -0.3474 ...
 $ age:glu: num  -0.636 -0.239 -1.359 0.071 -0.438 ...
 $ sex:bmi: num  1.304 0.935 0.915 0.142 0.635 ...
 $ sex:map: num  0.258 0.289 -0.381 0.5 -0.697 ...
 $ sex:tc : num  -1.02 0.131 -1.051 -0.274 -0.112 ...
 $ sex:ldl: num  -0.927 0.236 -0.913 -0.638 -0.452 ...
 $ sex:hdl: num  -0.647 -1.188 -0.377 1.189 0.238 ...
 $ sex:tch: num  -0.411 0.47 -0.411 -1.062 -0.296 ...
 $ sex:ltg: num  0.2988 1.2093 -0.0866 -0.6032 0.4857 ...
 $ sex:glu: num  -0.6171 1.6477 -0.8069 -0.0239 0.7283 ...
 $ bmi:map: num  0.189 0.191 -0.477 -0.195 -0.702 ...
 $ bmi:tc : num  -1.5061 -0.0595 -1.1853 -0.3231 -0.3239 ...
 $ bmi:ldl: num  -1.267 0.183 -0.976 -0.407 -0.536 ...
 $ bmi:hdl: num  -0.869 -1.41 -0.286 0.586 0.251 ...
 $ bmi:tch: num  -0.505 0.505 -0.484 -0.614 -0.388 ...
 $ bmi:ltg: num  0.1014 1.1613 -0.4085 -0.5893 0.0716 ...
 $ bmi:glu: num  -0.862 1.693 -0.89 -0.337 0.358 ...
 $ map:tc : num  -0.687 -0.148 -0.131 -0.451 -0.21 ...
 $ map:ldl: num  -0.5407 0.0388 -0.1034 -0.6114 -0.036 ...
 $ map:hdl: num  -0.235 -0.672 0.254 0.745 0.252 ...
 $ map:tch: num  -0.29 0.207 -0.258 -0.835 -0.29 ...
 $ map:ltg: num  -0.214 0.428 -0.427 -0.811 -0.748 ...
 $ map:glu: num  -0.541 0.659 -0.314 -0.23 -0.812 ...
 $ tc:ldl : num  -0.144 -0.551 -0.139 -0.509 -0.581 ...
 $ tc:hdl : num  0.8363 -0.3457 0.6304 -0.2579 -0.0392 ...
 $ tc:tch : num  -0.405 -0.326 -0.404 -0.295 -0.451 ...
 $ tc:ltg : num  -0.901 -0.259 -0.571 -0.392 -0.569 ...
 $ tc:glu : num  0.0202 0.0196 0.2073 -0.396 -0.4283 ...
 $ ldl:hdl: num  0.889 -0.446 0.705 -0.207 0.26 ...
 $ ldl:tch: num  -0.463 -0.243 -0.463 -0.21 -0.506 ...
 $ ldl:ltg: num  -0.6536 0.2724 -0.3783 -0.0708 -0.5638 ...
 $ ldl:glu: num  -0.0194 0.4995 0.1032 -0.4013 -0.6234 ...
 $ hdl:tch: num  0.703 -0.5 0.692 0.171 0.651 ...
 $ hdl:ltg: num  0.0179 -1.9846 0.3839 0.0399 0.3043 ...
 $ hdl:glu: num  0.654 -2.948 0.689 0.452 0.113 ...
 $ tch:ltg: num  -0.592 0.531 -0.574 -0.253 -0.537 ...
 $ tch:glu: num  -0.371 1.114 -0.362 -0.522 -0.34 ...
 $ ltg:glu: num  -0.584 2.184 -0.468 -0.526 0.183 ...

MCMC with BAS

library(BAS)
diabetes.bas = bas.lm(y ~ ., data=diabetes.train,
                      prior = "JZS",
                      method="MCMC",
                      n.models = 10000,
                      MCMC.iterations=500000,
                      thin = 10,
                      initprobs="eplogp",
                      force.heredity=FALSE)
   user  system elapsed 
 19.523   0.951  20.530 
[1] "number of unique models 5008"
  • increase MCMC.iterations?

  • check diagnostics

Estimates of Posterior Probabilities

  • relative frequencies \(\hat{P}_{RF}(\boldsymbol{\gamma}\mid \mathbf{Y}) = \frac{\text{# times } \boldsymbol{\gamma}\in S }{S}\)

    • ergodic average converges to \(p(\boldsymbol{\gamma}\mid \mathbf{Y})\) as \(S \to \infty\)

    • asymptoptically unbaised

  • renormalized posterior probabilities \(\hat{P}_{RN}(\boldsymbol{\gamma}\mid \mathbf{Y}) = \frac{p(\mathbf{Y}\mid \boldsymbol{\gamma}) p(\boldsymbol{\gamma})} {\sum_{\boldsymbol{\gamma}\in S} p(\mathbf{Y}\mid \boldsymbol{\gamma}) p(\boldsymbol{\gamma})}\)

    • also asymptoptically unbaised

    • Fisher consistent (e.g if we happen to enumerate all models in \(S\) iterations we recover the truth)

  • if we run long enough the two should agree

  • also look at other summaries i.e posterior inclusion probabilities \[\hat{p}(\gamma_j = 1 \mid \mathbf{Y}) = \sum_S \gamma_j \hat{P}(\boldsymbol{\gamma}\mid \mathbf{Y})\]

Diagnostic Plot

diagnostics(diabetes.bas, type="pip")
  • model probabilities converge much slower!

Out of Sample Prediction

  • What is the optimal value to predict \(\mathbf{Y}^{\text{test}}\) given \(\mathbf{Y}\) under squared error?

  • Iterated expectations leads to BMA for \(\textsf{E}[\mathbf{Y}^{\text{test}} \mid \mathbf{Y}]\)

  • Prediction under model averaging \[\hat{Y} = \sum_S (\hat{\alpha}_\boldsymbol{\gamma}+ \mathbf{X}_{\boldsymbol{\gamma}}^{\text{test}} \hat{\boldsymbol{\beta}}_{\boldsymbol{\gamma}}) \hat{p}(\boldsymbol{\gamma}\mid \mathbf{Y})\]

pred.bas = predict(diabetes.bas,
                   newdata=diabetes.test,
                   estimator="BMA",
                   se=TRUE)
mean((pred.bas$fit- diabetes.test$y)^2)
[1] 0.4558026

Credible Intervals & Coverage

  • posterior predictive distribution \[ p(\mathbf{y}^{\text{test}} \mid \mathbf{y}) = \sum_\boldsymbol{\gamma}p(\mathbf{y}^{\text{test}} \mid \mathbf{y}, \boldsymbol{\gamma})p(\boldsymbol{\gamma}\mid \mathbf{y}) \]

  • integrate out \(\alpha\) and \(\boldsymbol{\beta_\gamma}\) to get a normal predictive given \(\phi\) and \(\boldsymbol{\gamma}\) (and \(\mathbf{y}\))

  • integrate out \(\phi\) to get a t distribution given \(\boldsymbol{\gamma}\) and \(\mathbf{y}\)

  • credible intervals via sampling

    • sample a model from \(p(\boldsymbol{\gamma}\mid \mathbf{y})\)
    • conditional on a model sample \(y \sim p(\mathbf{y}^{\text{test}} \mid \mathbf{y}, \boldsymbol{\gamma})\)
    • compute quantiles from sammple \(y\)
ci.bas = confint(pred.bas);
coverage = mean(diabetes.test$y > ci.bas[,1] & diabetes.test$y < ci.bas[,2])
coverage
[1] 0.99

95% Prediction intervals

plot(ci.bas)
NULL
points(diabetes.test$y, col=2, pch=15)

Selection and Prediction

  • BMA - optimal for squared error loss Bayes \[\textsf{E}[\| \mathbf{Y}^{\text{test}} - a\|^2 \mid \mathbf{y}] = \textsf{E}[\| \mathbf{Y}^{\text{test}} - \textsf{E}[\mathbf{Y}^{\text{test}}\mid \mathbf{y}] \|^2 \mid \mathbf{y}] + \| \textsf{E}[\mathbf{Y}^{\text{test}}\mid \mathbf{y}] - a\|^2 \]

  • What if we want to use only a single model for prediction under squared error loss?

  • HPM: Highest Posterior Probability model is optimal for selection, but not prediction

  • MPM: Median Probabilty model (select model where PIP > 0.5) (optimal under certain conditions; nested models)

  • BPM: Best Probability Model - Model closest to BMA under loss (usually includes more predictors than HPM or MPM)

Example

pred.bas = predict(diabetes.bas,
                   newdata=diabetes.test,
                   estimator="BPM",
                   se=TRUE)
#MSE
mean((pred.bas$fit- diabetes.test$y)^2)
[1] 0.4740667
#Coverage
ci.bas = confint(pred.bas)
mean(diabetes.test$y > ci.bas[,1] &
     diabetes.test$y < ci.bas[,2])
[1] 0.98

Theory - Consistency of g-priors

  • desire that posterior probability of model goes to 1 as \(n \to \infty\)

    • does not alwyas hold if the null model is true (may be highest posterior probability model)
    • need prior on \(g\) to depend on \(n\) (rules out EB and fixed g-priors with \(g \ne n\))
    • asymptotically BMA collapses to the true model
  • other quantities may converge i.e. posterior mean

  • what if the true model \(\boldsymbol{\gamma}_T\) is not in \(\Gamma\)? What can we say?

    • \(\boldsymbol{{\cal M}}\)-complete; BMA converges to the model that is “closest” to the truth in Kullback-Leibler divergence
    • \(\boldsymbol{{\cal M}}\)-closed; realize that \((p\boldsymbol{\gamma}) = 0 \forall \boldsymbol{\gamma}\in \mathbf{G}\) and is nonsense but know \(\boldsymbol{\gamma}_T\), however want to use models in \(\mathbf{G}\) only
    • \(\boldsymbol{{\cal M}}\)-open; realize that \((p\boldsymbol{\gamma}) = 0 \forall \boldsymbol{\gamma}\in \mathbf{G}\) and is nonsense but know \(\boldsymbol{\gamma}_T\)
    • latter is related to “stacking” which is a frequentist method of ensemble learning using cross-validation; see Clyde & Iversen (2013) for the curious

Summary

  • Choice of prior on \(\boldsymbol{\beta_\gamma}\)
    • orthogonally invariant priors - multivariate Spike & Slab
    • products of independent Spike & Slab priors
    • non-semi-conjugate
  • priors on the models (sensitivity)
  • computation (MCMC, “stochastic search”, variational, orthogonal data augmentation, reversible jump-MCMC)
  • posterior summaries - select a model or “average” over all models

Other aspects of model selection?

  • transformations of \(Y\)
  • functions of \(X\): interactions or nonlinear functions such as splines kernels
  • choice of error distribution