Lecture 22: Nonparametric Regression

STA702

Merlise Clyde

Duke University

Semi-parametric Regression

  • Consider model \[Y_1, \dots, Y_n \sim \textsf{N}\left(\mu(\mathbf{x}_i, \boldsymbol{\theta}), \sigma \right)\]

  • Mean function \(\textsf{E}[Y_i \mid \boldsymbol{\theta}] = \mu(\mathbf{x}_i, \boldsymbol{\theta})\) falls in some class of nonlinear functions

  • Basis Function Expansion \[\mu(\mathbf{x}, \boldsymbol{\theta}) = \sum_{j = 1}^{J} \beta_j b_j(\mathbf{x})\]

  • \(b_j(\mathbf{x})\) is a pre-specified set of basis functions and \(\boldsymbol{\beta}= (\beta_1, \ldots, \beta_J)^T\) is a vector of coefficients or coordinates wrt to the basis

Examples

  • Taylor Series expansion of \(\mu(\mathbf{x})\) about point \(\chi\) \[\begin{aligned} \mu(x) & = \sum_k \frac{\mu^{(k)}(\chi)}{k!} (x - \chi)^k \\ & = \sum_k \beta_k (x - \chi)^k \end{aligned} \]

  • polynomial basis

  • can require a large number of terms to model globally

  • can have really poor behavior in regions without data

  • each basis function has a “global” impact

Other Basis Functions

  • cubic splines \[ b_j(x, \chi_j) = (x - \chi_j)^3_+\]

  • Gaussian Radial Basis \[ b_j(x, \chi_j) = \exp{\left(\frac{(x - \chi_j)^2}{l^2}\right)}\]

  • centers of basis functions \(\chi_j\)

  • width parameter \(l\) controls the scale at which the mean function dies out as a function of \(\mathbf{x}\) from the center

  • localized basis elements

Local Models

  • Multivariate Gaussian Kernel \(g\) with parameters \(\boldsymbol{\omega}= (\boldsymbol{\chi}, \boldsymbol{\Lambda})\) \[ b_j(\mathbf{x}, \boldsymbol{\omega}_j) = g( \boldsymbol{\Lambda}_j^{1/2} (\mathbf{x}- \boldsymbol{\chi}_j)) = \exp\left\{-\frac{1}{2}(\mathbf{x}- \boldsymbol{\chi}_j)^T \boldsymbol{\Lambda}_j (\mathbf{x}- \boldsymbol{\chi}_j)\right\} \]

  • Gaussian, Cauchy, Exponential, Double Exponential kernels (can be asymmetric)

  • translation and scaling of wavelet families

  • basis functions formed from a generator function \(g\) with location and scaling parameters

Bayesian Nonparametric Model

Mean function \[\mu(\mathbf{x}_i) = \sum_{j}^J b_j(\mathbf{x}_i, \boldsymbol{\omega}_j) \beta_j\]

  • conditional on the basis elements back to our Bayesian regression model

  • usually uncertainty about number of basis elements needed

  • could use BMA or other shrinkage priors

  • how should coefficients scale as \(J\) increases?

  • choice of \(J\)?

  • what about uncertainty in \(\boldsymbol{\omega}\) (locations and scales)?

  • priors on unknowns (\(J\), \(\{\beta_j\}\), \(\{\boldsymbol{\omega}_j\}\)) induces a prior on functions!

Stochastic Expansions

\[\mu(\mathbf{x}) = \sum_{j=0}^{J} b_j(\mathbf{x}, \boldsymbol{\omega}_j) \beta_j = \sum_{j=0}^{J} g(\boldsymbol{\Lambda}^{1/2}(\mathbf{x}- \boldsymbol{\omega}_j)) \beta_j \]

  • introduce a Lévy measure \(\nu( d\beta, d \boldsymbol{\omega})\)

  • Poisson distribution \(J \sim \textsf{Poi}(\nu_+)\) where \(\nu_+\equiv \nu(\mathbb{R}\times\boldsymbol{\Omega}) = \iint \nu(\beta, \boldsymbol{\omega}) d \beta \, d\boldsymbol{\omega}\)

  • conditional prior on \(\beta_j,\boldsymbol{\omega}_j \mid J \mathrel{\mathop{\sim}\limits^{\rm iid}}\pi(\beta, \boldsymbol{\omega}) \propto \nu(\beta,\boldsymbol{\omega})\)

  • Conditions on \(\nu\) (and \(g\))

    • need to have that \(|\beta_j|\) are absolutely summable
    • finite number of large coefficients (in absolute value)
    • allows an infinite number of small \(\beta_j \in [-\epsilon, \epsilon]\)

Gamma Process Example

\(\nu(\beta, \chi) = \beta^{-1} e^{- \beta \eta} \gamma(\chi) d \beta \, d \chi\) convolution of a gamma process

Stochastic Integral Representation

\[\mu(\mathbf{x}) = \sum_{j=0}^{J} b_j(\mathbf{x}, \boldsymbol{\omega}_j) \beta_j = \sum_{j=0}^{J} g(\boldsymbol{\Lambda}^{1/2}(\mathbf{x}- \boldsymbol{\omega}_j)) \beta_j = \int_{\boldsymbol{\Omega}} b(\mathbf{x}, \boldsymbol{\omega}) {\cal L}(d\boldsymbol{\omega})\]

  • \({\cal L}\) is a random signed measure (generalization of Completely Random Measures) \[ {\cal L}\sim \textsf{Lévy}(\nu) \qquad \qquad {\cal L}(d \boldsymbol{\omega}) = \sum_{j \le J}\beta_j \delta_{\boldsymbol{\omega}_j} (d \boldsymbol{\omega})\]

  • Lévy-Khinchine Poisson Representation of \({\cal L}\)

  • Poisson number of support points (possibly infinite!)

  • random support points of discrete measure \(\{ \boldsymbol{\omega}_j\}\)

  • random “jumps” \(\beta_j\)

  • Convenient to think of a random measure as stochastic process where \({\cal L}\) assigns random variables to sets \(A \in \boldsymbol{\Omega}\)

Examples

  • gamma process \[\begin{aligned} \nu(\beta, \boldsymbol{\omega}) & = \beta^{-1} e^{- \beta \eta} \pi(\boldsymbol{\omega}) d \beta \, d \boldsymbol{\omega}\\ {\cal L}(A) & \sim \textsf{Gamma}(\pi(A), \eta) \end{aligned}\]

  • non-negative coefficients plus non-negative basis functions allows priors on non-negative functions without transformations

  • \(\alpha\)-Stable process (Cauchy process is \(\alpha = 1\)) \[\nu(\beta, \boldsymbol{\omega}) = c_\alpha |\beta|^{-(\alpha +1)}\ \pi(\boldsymbol{\omega}) \qquad 0 < \alpha < 2 \]

  • \(\nu^+(\mathbb{R}, \boldsymbol{\Omega}) = \infty\) for both the Gamma and \(\alpha\)-Stable processes

  • Fine in theory, but problematic for MCMC!

Prior Approximation I

Truncate measure \(\nu\) to obtain a finite expansion:

  • Finite number of support points \(\boldsymbol{\omega}\) with \(\beta\) in \([-\epsilon, \epsilon]^c\)
  • Fix \(\epsilon\) (for given prior approximation error)
  • Use approximate Lévy measure \(\nu_{\epsilon}(\beta, \boldsymbol{\omega}) \equiv \nu(\beta, \boldsymbol{\omega})\mathbf{1}(|\beta| > \epsilon)\)
  • \(\Rightarrow\) \(J \sim \textsf{Poi}(\nu_{\epsilon}^+)\) where \(\nu^+_{\epsilon} = \nu([-\epsilon, \epsilon]^c, \boldsymbol{\Omega})\)
  • \(\Rightarrow\) \(\beta_j, \boldsymbol{\omega}_j \mathrel{\mathop{\sim}\limits^{\rm iid}}\pi(d\beta, d\boldsymbol{\omega}) \equiv \nu_\epsilon(d\beta , d \boldsymbol{\omega})/\nu^+_{\epsilon}\)
  • for \(\alpha\)-Stable, the approximation leads to double Pareto distributions for \(\beta\) \[\pi(\beta_j) = \frac{\epsilon}{2 \eta} |\beta|^{- \alpha - 1} \mathbf{1}_{|\beta| > \frac{\boldsymbol{\epsilon}}{\eta}}\]

Truncated Cauchy Process Prior

Truncated Cauchy

Simulation

Kernels

Comparison of Lévy Adaptive Regression Kernels

Inference via Reversible Jump MCMC

trans-dimensional MCMC

  • number of support points \(J\) varies from iteration to iteration
    • add a new point (birth)
    • delete an existing point (death)
    • combine two points (merge)
    • split a point into two
  • update existing point(s)

MotorCycle Acceleration

Mass Spectroscopy

Summary

  • more parsimonious than “shrinkage” priors or SVM

  • allows for increasing number of support points as \(n\) increases

  • control MSE a priori through choice of \(\epsilon\)

  • no problem with non-normal data, non-negative functions or even discontinuous functions

  • credible and prediction intervals

  • robust alternative to Gaussian Process Priors

  • hard to scale up random scales, locations as dimension of \(\mathbf{x}\) increases

  • next - Prior Approximation II