Lecture 23: Bayesian Adaptive Regression Kernels

STA702

Merlise Clyde

Duke University

Nonparametric Regression

  • Consider model \(Y_1, \dots, Y_n \sim \textsf{N}\left(\mu(\mathbf{x}_i), \sigma \right)\)

  • Mean function represented via a Stochastic Expansion \[\mu(\mathbf{x}_i) = \sum_{j \le J} b_j(\mathbf{x}_i, \boldsymbol{\omega}_j) \beta_j\]

  • Multivariate Gaussian Kernel \(g\) with parameters \(\boldsymbol{\omega}= (\boldsymbol{\chi}, \boldsymbol{\Lambda})\) \[ b_j(\mathbf{x}, \boldsymbol{\omega}_j) = g( \boldsymbol{\Lambda}_j^{1/2} (\mathbf{x}- \boldsymbol{\chi}_j)) = \exp\left\{-\frac{1}{2}(\mathbf{x}- \boldsymbol{\chi}_j)^T \boldsymbol{\Lambda}_j (\mathbf{x}- \boldsymbol{\chi}_j)\right\} \]

  • introduce a Lévy measure \(\nu( d\beta, d \boldsymbol{\omega})\)

  • Poisson distribution \(J \sim \textsf{Poi}(\nu_+)\) where \(\nu_+\equiv \nu(\mathbb{R}\times\boldsymbol{\Omega}) = \iint \nu(\beta, \boldsymbol{\omega}) d \beta \, d\boldsymbol{\omega}\) \[\beta_j,\boldsymbol{\omega}_j \mid J \mathrel{\mathop{\sim}\limits^{\rm iid}}\pi(\beta, \boldsymbol{\omega}) \propto \nu(\beta,\boldsymbol{\omega})\]

Function Spaces

  • Conditions on \(\nu\)

    • need to have that \(|\beta_j|\) are absolutely summable
    • finite number of large coefficients (in absolute value)
    • allows an infinite number of small \(\beta_j \in [-\epsilon, \epsilon]\)
  • satisfied if \[\iint_{\mathbb{R}\times \boldsymbol{\Omega}}( 1 \wedge |\beta|) \nu(\beta, \boldsymbol{\omega}) d\beta \, d\boldsymbol{\omega}< \infty\]

  • Mean function \(\textsf{E}[Y_i \mid \boldsymbol{\theta}] = \mu(\mathbf{x}_i, \boldsymbol{\theta})\) falls in some class of nonlinear functions based on \(g\) and prior on \(\boldsymbol{\Lambda}\)

    • Besov Space
    • Sobolov Space

Inference via Reversible Jump MCMC

  • number of support points \(J\) varies from iteration to iteration

    • add a new point (birth)
    • delete an existing point (death)
    • combine two points (merge)
    • split a point into two
  • update existing point(s)

  • can be much faster than shrinkage or BMA with a fixed but large \(J\)

So far

  • more parsimonious than “shrinkage” priors or SVM with fixed \(J\)

  • allows for increasing number of support points as \(n\) increases (adapts to smoothness)

  • no problem with non-normal data, non-negative functions or even discontinuous functions

  • credible and prediction intervals; uncertainty quantification

  • robust alternative to Gaussian Process Priors

  • But - hard to scale up random scales & locations as dimension of \(\mathbf{x}\) increases

  • Alternative Prior Approximation II

Higher Dimensional \(\mathbf{X}\)

MCMC is (currently) too slow in higher dimensional space to allow

  • \(\boldsymbol{\chi}\) to be completely arbitrary; restrict support to observed \(\{\mathbf{x}_i\}\) like in SVM (or observed quantiles)

  • use a common diagonal \(\boldsymbol{\Lambda}\) for all kernels

  • Kernels take form: \[\begin{aligned} b_j(\mathbf{x}, \boldsymbol{\omega}_j) & = & \prod_d \exp\{ -\frac{1}{2} \lambda_d (x_d - \chi_{dj})^2 \} \\ \mu(\mathbf{x}) & = & \sum_j b_j(\mathbf{x}, \boldsymbol{\omega}_j) \beta_j \end{aligned}\]

  • accomodates nonlinear interactions among variables

  • ensemble model like random forests, boosting, BART, SVM

Approximate Lévy Prior II

  • \(\alpha\)-Stable process: \(\nu(d\beta, d \boldsymbol{\omega}) = \gamma c_\alpha |\beta|^{-(\alpha + 1)} d\beta \ \pi(d \boldsymbol{\omega})\)

  • Continuous Approximation to an \(\alpha\)-Stable process via a Student \(t(\alpha, 0, \epsilon)\): \[\nu_\epsilon(d\beta, d \boldsymbol{\omega}) = \gamma c_\alpha (\beta^2 + \alpha \epsilon^2)^{-(\alpha + 1)/2} d\beta \ \pi(d \boldsymbol{\omega})\]

  • Based on the following hierarchical prior

    _j _j & N(0, _j^-1) & & _j Gamma(, )
    J & ~Poi(^+_) & & ^+_= _(, ) = ()

Key Idea: need to have variance/scale of coefficients decrease as \(J\) increases

Limiting Case

\[\begin{aligned} \beta_j \mid \varphi_j & \mathrel{\mathop{\sim}\limits^{\rm ind}}&\textsf{N}(0, 1/\varphi_j) \\ \varphi_j & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Gamma}(\alpha/2, 0) \end{aligned}\]

Notes:

  • Require \(0 < \alpha < 2\) Additional restrictions on \(\omega\)

  • Tipping’s Relevance Vector Machine corresponds to \(\alpha = 0\) (improper posterior!)

  • Provides an extension of Generalized Ridge Priors to infinite dimensional case

  • Cauchy process corresponds to \(\alpha = 1\)

  • Infinite dimensional analog of Cauchy priors

Simplification with \(\alpha = 1\)

  • Poisson number of points \(J \sim \textsf{Poi}(\gamma/\boldsymbol{\epsilon})\)

  • Given \(J\), \([ n_1 : n_n] \sim \textsf{MultNom}(J, 1/(n+1))\) points supported at each kernel located at \(\mathbf{x}_j\)

  • Aggregating, the regression mean function can be rewritten as \[\mu(\mathbf{x}) = \sum_{i=0}^n \tilde{\beta}_i b_j(\mathbf{x}, \boldsymbol{\omega}_i), \quad \tilde{\beta}_i = \sum_{\{j \, \mid \boldsymbol{\chi}_j = \mathbf{x}_i\}} \beta_j\]

if \(\alpha = 1\), not only is the Cauchy process infinitely divisible, the approximated Cauchy prior distributions for \(\beta_j\) are also infinitely divisible!

\[ \tilde{\beta}_i \mathrel{\mathop{\sim}\limits^{\rm ind}} \textsf{N}(0, n_i^2 \tilde{\varphi}_i^{-1}), \qquad \tilde{\varphi}_i \mathrel{\mathop{\sim}\limits^{\rm iid}}\textsf{Gamma}( 1/2, \boldsymbol{\epsilon}^2/2) \]

At most \(n\) non-zero coefficients!

Inference for Normal Model

  • integrate out \(\tilde{\boldsymbol{\beta}}\) for marginal likelihood \(\cal{L}(J, \{n_i\}, \{\tilde{\varphi}_i\}, \sigma^2, \boldsymbol{\lambda})\) \[\mathbf{Y}\mid \sigma^2, \{n_i\}, \{\tilde{\varphi}_i\}, \boldsymbol{\lambda}\sim \textsf{N}\left(\mathbf{0}_n, \sigma^2 \mathbf{I}_n + \mathbf{b} \, \textsf{diag}\left(\frac{n_i^2}{\tilde{\varphi}_i} \right)\mathbf{b}^T\right)\]

  • if \(n_i = 0\) then the kernel located at \(\mathbf{x}_i\) drops out so we still need birth/death steps via RJ-MCMC for \(\{n_i, \tilde{\varphi}_i\}\)

  • for \(J < n\) take advantage of the Woodbury matrix identity for matrix inversion likelihood \[(A + U C V)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U){-1} V A^{-1}\]

  • update \(\sigma^2, \boldsymbol{\lambda}\) via usual MCMC

  • for fixed \(J\) and \(\{n_i\}\), can update \(\{\tilde{\varphi}_i\}, \sigma^2, \boldsymbol{\lambda})\) via usual MCMC (fixed dimension)

Feature Selection in Kernel

  • Product structure allows interactions between variables
  • Many input variables may be irrelevant
  • Feature selection; if \(\lambda_d = 0\) variable \(\mathbf{x}_d\) is removed from all kernels
  • Allow point mass on \(\lambda_d = 0\) with probability \(p_\lambda \sim \textsf{Beta}(a,b)\) (in practice have used \(a = b = 1\)
  • can also constrain all \(\lambda_d\) that are non-zero to be equal across dimensions

Binary Regression

  • add latent Gaussian variable as in Albert & Chib

bark package

library(bark)
set.seed(42)
n = 500
circle2 = data.frame(sim_circle(n, dim = 2))
plot(x.1 ~ x.2, data=circle2, col=y+1)

Circle Data Classification

BARK Classification

set.seed(42)
train = sample(1:n, size = floor(n/2), rep=FALSE)
circle2.bark = bark(y ~ . , data = circle2,
                    subset = train,
                    testdata = circle2[-train,],
                    classification = TRUE,
                    printevery = 10000,
                    selection = TRUE, 
                    common_lambdas = TRUE)
  • classification = TRUE for probit regression

  • selection = TRUE allows some of the \(\lambda_j\) to be 0

  • common_lambdas = TRUE sets all (non-zero) \(\lambda_j\) to a common \(\lambda\)

Missclassification

misscl = (circle2.bark$yhat.test.mean > 0) != circle2[-train, "y"]
plot(x.1 ~ x.2, data=circle2[-train,], pch=circle2[-train, "y"]+1, col=misscl + 1)
title(paste("Missclassification Rate", round(mean(misscl), 4)))

Support Vector Machines (SVM) & BART

library(e1071)
circle2.svm = svm(y ~ x.1 + x.2, data=circle2[train,],
                  type="C")
pred.svm = predict(circle2.svm, circle2[-train,])
mean(pred.svm != circle2[-train, "y"])
[1] 0.048
suppressMessages(library(BART))
circle.bart = pbart(x.train = circle2[train, 1:2],
                    y.train = circle2[train, "y"])
pred.bart = predict(circle.bart, circle2[-train, 1:2])
misscl.bart = mean((pred.bart$prob.test.mean > .5) != 
              circle2[-train, "y"])
[1] 0.036

Comparisons

Data Sets n p BARK-D BARK-SE BARK-SD SVM BART
Circle 2 200 2 4.91% 1.88% 1.93% 5.03% 3.97%
Circle 5 200 5 4.70% 1.47% 1.65% 10.99% 6.51%
Circle 20 200 20 4.84% 2.09% 3.69% 44.10% 15.10%
Bank 200 6 1.25% 0.55% 0.88% 1.12% 0.50%
BC 569 30 4.02% 2.49% 6.09% 2.70% 3.36%
Ionosphere 351 33 8.59% 5.78% 10.87% 5.17% 7.34%
  • BARK-D: different \(\lambda_d\) for each dimension
  • BARK-SE: selection and equal \(\lambda_d\) for non-zero \(\lambda_d\)
  • BARK-SD: selection and different \(\lambda_d\) for non-zero \(\lambda_d\)

Needs & Limitations

  • NP Bayes of many flavors often does better than frequentist methods (BARK, BART, Treed GP, more)
  • Hyper-parameter specification - theory & computational approximation
  • asymptotic theory (rates of convergence)
  • need faster code for BARK that is easier for users (BART & TGP are great!)
  • Can these models be added to JAGS, STAN, etc instead of stand-alone R packages
  • With availability of code what are caveats for users?

Summary

Lévy Random Field Priors & LARK/BARK models:

  • Provide limit of finite dimensional priors (GRP & SVSS) to infinite dimensional setting
  • Adaptive bandwidth for kernel regression
  • Allow flexible generating functions
  • Provide sparser representations compared to SVM & RVM, with coherent Bayesian interpretation
  • Incorporation of prior knowledge if available
  • Relax assumptions of equally spaced data and Gaussian likelihood
  • Hierarchical Extensions
  • Formulation allows one to define stochastic processes on arbitrary spaces (spheres, manifolds)