Lecture 13: Ridge Regression, Lasso and Mixture Priors

STA702

Merlise Clyde

Duke University

Ridge Regression

Model: \(\mathbf{Y}= \mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\)

  • typically expect the intercept \(\alpha\) to be a different order of magnitude from the other predictors. Adopt a two block prior with \(p(\alpha) \propto 1\)

  • Prior \(\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\mathbf{0}_b, \frac{1} {\phi \kappa} \mathbf{I}_p\)) implies the \(\boldsymbol{\beta}\) are exchangable a priori (i.e. distribution is invariant under permuting the labels and with a common scale and mean)

  • Posterior for \(\boldsymbol{\beta}\) \[\boldsymbol{\beta}\mid \phi, \kappa, \mathbf{Y}\sim \textsf{N}\left((\kappa I_p + X^TX)^{-1} X^T Y, \frac{1}{\phi}(\kappa I_p + X^TX)^{-1} \right)\]

  • assume that \(\mathbf{X}\) has been centered and scaled so that \(\mathbf{X}^T\mathbf{X}= \textsf{corr}(\mathbf{X})\) and \(\mathbf{1}_n^T \mathbf{X}= \mathbf{0}_p\)

X = scale(X)/sqrt{nrow(X) - 1}

Bayes Ridge Regression

  • related to penalized maximum likelihood estimation \[-\frac{\phi}{2}\left(\|\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}\|^2 + \kappa \| \boldsymbol{\beta}\|^2 \right) \]

  • frequentist’s expected mean squared error loss for using \(\mathbf{b}_n\) \[\textsf{E}_{\mathbf{Y}\mid \boldsymbol{\beta}_*}[\| \mathbf{b}_n - \boldsymbol{\beta}_* \|^2] = \sigma^2 \sum_{j = 1}^2 \frac{\lambda_j}{(\lambda_j + \kappa)^2} + \kappa^2 \boldsymbol{\beta}_*^T(\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I}_p)^{-2} \boldsymbol{\beta}_*\]

  • eigenvalues of \(\mathbf{X}^T\mathbf{X}= \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^T\) with \([\boldsymbol{\Lambda}]_{jj} = \lambda_j\)

  • can show that there always is a value of \(\kappa\) where is smaller for the (Bayes) Ridge estimator than MLE

  • Unfortunately the optimal choice depends on “true” \(\boldsymbol{\beta}_*\)!

  • orthogonal \(\mathbf{X}\) leads to James-Stein solution related to Empirical Bayes

Choice of \(\kappa\)?

  • fixed a priori Bayes (and how to choose?)

  • Cross-validation (frequentist)

  • Empirical Bayes? (frequentist/Bayes)

  • Should there be a common \(\kappa\)? (same shrinkage across all variables?)

  • Or a \(\kappa_j\) per variable? (or shared among a group of variables (eg. factors) ?)

  • Treat as unknown!

Mixture of Conjugate Priors

  • can place a prior on \(\kappa\) or \(\kappa_j\) for fully Bayes

  • similar option for \(g\) in the \(g\) priors

  • often improved robustness over fixed choices of hyperparameter

  • may not have cloosed form posterior but sampling is still often easy!

  • Examples:

    • Bayesian Lasso (Park & Casella, Hans)
    • Generalized Double Pareto (Armagan, Dunson & Lee)
    • Horseshoe (Carvalho, Polson & Scott)
    • Normal-Exponential-Gamma (Griffen & Brown)
    • mixtures of \(g\)-priors (Liang et al)

Lasso

Tibshirani (JRSS B 1996) proposed estimating coefficients through \(L_1\) constrained least squares ``Least Absolute Shrinkage and Selection Operator’’

  • Control how large coefficients may grow \[\begin{align} \min_{\boldsymbol{\beta}} & \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 \\ & \textsf{ subject to } \sum |\beta_j| \le t \end{align}\]

  • Equivalent Quadratic Programming Problem for ``penalized’’ Likelihood \[\min_{\boldsymbol{\beta}} \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1\]

  • Equivalent to finding posterior mode \[ \max_{\boldsymbol{\beta}} -\frac{\phi}{2} \{ \| \mathbf{Y}- \mathbf{1}_n \alpha - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \} \]

Bayesian Lasso

Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso
\[\begin{eqnarray*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta}, \phi & \sim & \textsf{N}(\mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}, \mathbf{I}_n/\phi) \\ \boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\ \tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\ p(\alpha, \phi) & \propto& 1/\phi \\ \end{eqnarray*}\]

  • Can show that \(\beta_j \mid \phi, \lambda \mathrel{\mathop{\sim}\limits^{\rm iid}}DE(\lambda \sqrt{\phi})\) \[\int_0^\infty \frac{1}{\sqrt{2 \pi s}} e^{-\frac{1}{2} \phi \frac{\beta^2}{s }} \, \frac{\lambda^2}{2} e^{- \frac{\lambda^2 s}{2}}\, ds = \frac{\lambda \phi^{1/2}}{2} e^{-\lambda \phi^{1/2} |\beta|} \]

  • equivalent to penalized regression with \(\lambda^* = \lambda/\phi^{1/2}\)

  • Scale Mixture of Normals (Andrews and Mallows 1974)

Gibbs Sampling

  • Integrate out \(\alpha\): \(\alpha \mid \mathbf{Y}, \phi \sim \textsf{N}(\bar{y}, 1/(n \phi)\)

  • \(\boldsymbol{\beta}\mid \boldsymbol{\tau}, \phi, \lambda, \mathbf{Y}\sim \textsf{N}(, )\)

  • \(\phi \mid \boldsymbol{\tau}, \boldsymbol{\beta}, \lambda, \mathbf{Y}\sim \mathbf{G}( , )\)

  • \(1/\tau_j^2 \mid \boldsymbol{\beta}, \phi, \lambda, \mathbf{Y}\sim \textsf{InvGaussian}( , )\)

  • For \(X \sim \textsf{InvGaussian}(\mu, \lambda)\), the density is \[ f(x) = \sqrt{\frac{\lambda^2}{2 \pi}} x^{-3/2} e^{- \frac{1}{2} \frac{ \lambda^2( x - \mu)^2} {\mu^2 x}} \qquad x > 0 \]

Homework

Derive the full conditionals for \(\boldsymbol{\beta}\), \(\phi\), \(1/\tau^2\) for the model in Park & Casella

Choice of Estimator

  • Posterior mode (like in the LASSO) may set some coefficients exactly to zero leading to variable selection - optimization problem (quadratic programming)

  • Posterior distribution for \(\beta_j\) does not assign any probability to \(\beta_j = 0\) so posterior mean results in no selection, but shrinkage of coeffiecients to prior mean of zero

  • In both cases, large coefficients may be over-shrunk (true for LASSO too)!

  • Issue is that the tails of the prior under the double exponential are not heavier than the normal likelihood

  • Only one parameter \(\lambda\) that controls shrinkage and selection (with the mode)

  • Need priors with heavier tails than the normal!!!

Shrinkage Comparison with Posterior Mean

HS - Horseshoe of Carvalho, Polson & Scott (slight difference in CPS notation)

\[\begin{align} \boldsymbol{\beta}\mid \phi, \boldsymbol{\tau}& \sim \textsf{N}(\mathbf{0}_p, \frac{\textsf{diag}(\boldsymbol{\tau}^2)}{ \phi }) \\ \tau_j \mid \lambda & \mathrel{\mathop{\sim}\limits^{\rm iid}}\textsf{C}^+(0, \lambda^2) \\ \lambda & \sim \textsf{C}^+(0, 1) \\ p(\alpha, \phi) & \propto 1/\phi) \end{align}\]

  • resulting prior on \(\boldsymbol{\beta}\) has heavy tails like a Cauchy!

Bounded Influence for Mean

  • canonical representation (normal means problem) \(\mathbf{Y}= \mathbf{I}_p \boldsymbol{\beta}+ \boldsymbol{\epsilon}\) so \(\hat{\beta}_i = y_i\) \[ E[\beta_i \mid \mathbf{Y}] = \int_0^1 (1 - \psi_i) y^*_i p(\psi_i \mid \mathbf{Y})\ d\psi_i = (1 - \textsf{E}[\psi_i \mid y^*_i]) y^*_i\]

  • \(\psi_i = 1/(1 + \tau_i^2)\) shrinkage factor

  • Posterior mean \(E[\beta \mid y] = y + \frac{d} {d y} \log m(y)\) where \(m(y)\) is the predictive density under the prior (known \(\lambda\))

  • Bounded Influence: if \(\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = c\) (for some constant \(c\))

  • HS has bounded influence where \(c = 0\) so \[\lim_{|y| \to \infty} E[\beta \mid y) \to y \]

  • DE has bounded influence but \((c \ne 0\)); bound does not decay to zero and bias for large \(|y_i|\)

Properties for Shrinkage and Selection

Fan & Li (JASA 2001) discuss Variable Selection via Nonconcave Penalties and Oracle Properties

  • Model \(Y = \mathbf{1}_n \alpha + \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\mathbf{X}^T\mathbf{X}= \mathbf{I}_p\) (orthonormal) and \(\boldsymbol{\epsilon}\sim N(0, \mathbf{I}_n)\)

  • Penalized Log Likelihood \[\frac 1 2 \|\mathbf{Y}- \hat{\mathbf{Y}}\|^2 +\frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \text{ pen}_\lambda(|\beta_j|)\]

  • duality \(\text{pen}_\lambda(|\beta|) \equiv - \log(p(|\beta_j|))\) (negative log prior)

  • Objectives:

    • Unbiasedness: for large \(|\beta_j|\)
    • Sparsity: thresholding rule sets small coefficients to 0
    • Continuity: continuous in \(\hat{\beta}_j\)

Conditions on Prior/Penalty

Derivative of \(\frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \text{pen}_\lambda(|\beta_j|)\) is \(\mathop{\mathrm{sgn}}(\beta_j)\left\{|\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|) \right\} - \hat{\beta}_j\)

  • Conditions:

    • unbiased: if \(\text{pen}^\prime_\lambda(|\beta|) = 0\) for large \(|\beta|\); estimator is \(\hat{\beta}_j\)

    • thresholding: \(\min \left\{ |\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|)\right\} > 0\) then estimator is 0 if \(|\hat{\beta}_j| < \min \left\{ |\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|) \right\}\)

    • continuity: minimum of \(|\beta_j| + \text{pen}^\prime_\lambda(|\beta_j|)\) is at zero

  • Can show that LASSO/ Bayesian Lasso fails conditions for unbiasedness

  • What about other Bayes methods?

Homework

Check the conditions for the DE, Generalized Double Pareto and Cauchy priors

Selection

  • Only get variable selection if we use the posterior mode

  • If selection is a goal of analysis build it into the model/analysis/post-analysis

    • prior belief that coefficient is zero

    • selection solved as a post-analysis decision problem

  • Even if selection is not an objective, account for the uncertainty that some predictors may be unrelated