STA702
Duke University
Model: \(\mathbf{Y}= \mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\)
typically expect the intercept \(\alpha\) to be a different order of magnitude from the other predictors. Adopt a two block prior with \(p(\alpha) \propto 1\)
Prior \(\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\mathbf{0}_b, \frac{1} {\phi \kappa} \mathbf{I}_p\)) implies the \(\boldsymbol{\beta}\) are exchangable a priori (i.e. distribution is invariant under permuting the labels and with a common scale and mean)
Posterior for \(\boldsymbol{\beta}\) \[\boldsymbol{\beta}\mid \phi, \kappa, \mathbf{Y}\sim \textsf{N}\left((\kappa I_p + X^TX)^{-1} X^T Y, \frac{1}{\phi}(\kappa I_p + X^TX)^{-1} \right)\]
assume that \(\mathbf{X}\) has been centered and scaled so that \(\mathbf{X}^T\mathbf{X}= \textsf{corr}(\mathbf{X})\) and \(\mathbf{1}_n^T \mathbf{X}= \mathbf{0}_p\)
related to penalized maximum likelihood estimation \[-\frac{\phi}{2}\left(\|\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}\|^2 + \kappa \| \boldsymbol{\beta}\|^2 \right) \]
frequentist’s expected mean squared error loss for using \(\mathbf{b}_n\) \[\textsf{E}_{\mathbf{Y}\mid \boldsymbol{\beta}_*}[\| \mathbf{b}_n - \boldsymbol{\beta}_* \|^2] = \sigma^2 \sum_{j = 1}^2 \frac{\lambda_j}{(\lambda_j + \kappa)^2} + \kappa^2 \boldsymbol{\beta}_*^T(\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I}_p)^{-2} \boldsymbol{\beta}_*\]
eigenvalues of \(\mathbf{X}^T\mathbf{X}= \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^T\) with \([\boldsymbol{\Lambda}]_{jj} = \lambda_j\)
can show that there always is a value of \(\kappa\) where is smaller for the (Bayes) Ridge estimator than MLE
Unfortunately the optimal choice depends on “true” \(\boldsymbol{\beta}_*\)!
orthogonal \(\mathbf{X}\) leads to James-Stein solution related to Empirical Bayes
fixed a priori Bayes (and how to choose?)
Cross-validation (frequentist)
Empirical Bayes? (frequentist/Bayes)
Should there be a common \(\kappa\)? (same shrinkage across all variables?)
Or a \(\kappa_j\) per variable? (or shared among a group of variables (eg. factors) ?)
Treat as unknown!
can place a prior on \(\kappa\) or \(\kappa_j\) for fully Bayes
similar option for \(g\) in the \(g\) priors
often improved robustness over fixed choices of hyperparameter
may not have cloosed form posterior but sampling is still often easy!
Examples:
Tibshirani (JRSS B 1996) proposed estimating coefficients through \(L_1\) constrained least squares ``Least Absolute Shrinkage and Selection Operator’’
Control how large coefficients may grow \[\begin{align} \min_{\boldsymbol{\beta}} & \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 \\ & \textsf{ subject to } \sum |\beta_j| \le t \end{align}\]
Equivalent Quadratic Programming Problem for ``penalized’’ Likelihood \[\min_{\boldsymbol{\beta}} \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1\]
Equivalent to finding posterior mode \[ \max_{\boldsymbol{\beta}} -\frac{\phi}{2} \{ \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \} \]
Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso
\[\begin{eqnarray*}
\mathbf{Y}\mid \alpha, \boldsymbol{\beta}, \phi & \sim & \textsf{N}(\mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}, \mathbf{I}_n/\phi) \\
\boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\
\tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\
p(\alpha, \phi) & \propto& 1/\phi \\
\end{eqnarray*}\]
Can show that \(\beta_j \mid \phi, \lambda \mathrel{\mathop{\sim}\limits^{\rm iid}}DE(\lambda \sqrt{\phi})\) \[\int_0^\infty \frac{1}{\sqrt{2 \pi s}} e^{-\frac{1}{2} \phi \frac{\beta^2}{s }} \, \frac{\lambda^2}{2} e^{- \frac{\lambda^2 s}{2}}\, ds = \frac{\lambda \phi^{1/2}}{2} e^{-\lambda \phi^{1/2} |\beta|} \]
equivalent to penalized regression with \(\lambda^* = \lambda/\phi^{1/2}\)
Scale Mixture of Normals (Andrews and Mallows 1974)
Integrate out \(\alpha\): \(\alpha \mid \mathbf{Y}, \phi \sim \textsf{N}(\bar{y}, 1/(n \phi)\)
\(\boldsymbol{\beta}\mid \boldsymbol{\tau}, \phi, \lambda, \mathbf{Y}\sim \textsf{N}(, )\)
\(\phi \mid \boldsymbol{\tau}, \boldsymbol{\beta}, \lambda, \mathbf{Y}\sim \mathbf{G}( , )\)
\(1/\tau_j^2 \mid \boldsymbol{\beta}, \phi, \lambda, \mathbf{Y}\sim \textsf{InvGaussian}( , )\)
For \(X \sim \textsf{InvGaussian}(\mu, \lambda)\), the density is \[ f(x) = \sqrt{\frac{\lambda^2}{2 \pi}} x^{-3/2} e^{- \frac{1}{2} \frac{ \lambda^2( x - \mu)^2} {\mu^2 x}} \qquad x > 0 \]
Homework
Derive the full conditionals for \(\boldsymbol{\beta}\), \(\phi\), \(1/\tau^2\) for the model in Park & Casella
Posterior mode (like in the LASSO) may set some coefficients exactly to zero leading to variable selection - optimization problem (quadratic programming)
Posterior distribution for \(\beta_j\) does not assign any probability to \(\beta_j = 0\) so posterior mean results in no selection, but shrinkage of coeffiecients to prior mean of zero
In both cases, large coefficients may be over-shrunk (true for LASSO too)!
Issue is that the tails of the prior under the double exponential are not heavier than the normal likelihood
Only one parameter \(\lambda\) that controls shrinkage and selection (with the mode)
Need priors with heavier tails than the normal!!!
HS - Horseshoe of Carvalho, Polson & Scott (slight difference in CPS notation)
\[\begin{align} \boldsymbol{\beta}\mid \phi, \boldsymbol{\tau}& \sim \textsf{N}(\mathbf{0}_p, \frac{\textsf{diag}(\boldsymbol{\tau}^2)}{ \phi }) \\ \tau_j \mid \lambda & \mathrel{\mathop{\sim}\limits^{\rm iid}}\textsf{C}^+(0, \lambda^2) \\ \lambda & \sim \textsf{C}^+(0, 1) \\ p(\alpha, \phi) & \propto 1/\phi) \end{align}\]
canonical representation (normal means problem) \(\mathbf{Y}= \mathbf{I}_p \boldsymbol{\beta}+ \boldsymbol{\epsilon}\) so \(\hat{\beta}_i = y_i\) \[ E[\beta_i \mid \mathbf{Y}] = \int_0^1 (1 - \psi_i) y^*_i p(\psi_i \mid \mathbf{Y})\ d\psi_i = (1 - \textsf{E}[\psi_i \mid y^*_i]) y^*_i\]
\(\psi_i = 1/(1 + \tau_i^2)\) shrinkage factor
Posterior mean \(E[\beta \mid y] = y + \frac{d} {d y} \log m(y)\) where \(m(y)\) is the predictive density under the prior (known \(\lambda\))
Bounded Influence: if \(\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = c\) (for some constant \(c\))
HS has bounded influence where \(c = 0\) so \[\lim_{|y| \to \infty} E[\beta \mid y) \to y \]
DE has bounded influence but \((c \ne 0\)); bound does not decay to zero and bias for large \(|y_i|\)
Fan & Li (JASA 2001) discuss Variable Selection via Nonconcave Penalties and Oracle Properties
Model \(Y = \mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\mathbf{X}^T\mathbf{X}= \mathbf{I}_p\) (orthonormal) and \(\boldsymbol{\epsilon}\sim N(0, \mathbf{I}_n)\)
Penalized Log Likelihood \[\frac 1 2 \|\mathbf{Y}- \hat{\mathbf{Y}}\|^2 +\frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \text{ pen}_\lambda(|\beta_j|)\]
duality \(\text{pen}_\lambda(|\beta|) \equiv - \log(p(|\beta_j|))\) (negative log prior)
Objectives:
Derivative of \(\frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \text{pen}_\lambda(|\beta_j|)\) is \(\mathop{\mathrm{sgn}}(\beta_j)\left\{|\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|) \right\} - \hat{\beta}_j\)
Conditions:
unbiased: if \(\text{pen}^\prime_\lambda(|\beta|) = 0\) for large \(|\beta|\); estimator is \(\hat{\beta}_j\)
thresholding: \(\min \left\{ |\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|)\right\} > 0\) then estimator is 0 if \(|\hat{\beta}_j| < \min \left\{ |\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|) \right\}\)
continuity: minimum of \(|\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|)\) is at zero
Can show that LASSO/ Bayesian Lasso fails conditions for unbiasedness
What about other Bayes methods?
Homework
Check the conditions for the DE, Generalized Double Pareto and Cauchy priors
Only get variable selection if we use the posterior mode
If selection is a goal of analysis build it into the model/analysis/post-analysis
prior belief that coefficient is zero
selection solved as a post-analysis decision problem
Even if selection is not an objective, account for the uncertainty that some predictors may be unrelated