STA702
Duke University
Consider model \[Y_1, \dots, Y_n \sim \textsf{N}\left(\mu(\mathbf{x}_i, \boldsymbol{\theta}), \sigma \right)\]
Mean function \(\textsf{E}[Y_i \mid \boldsymbol{\theta}] = \mu(\mathbf{x}_i, \boldsymbol{\theta})\) falls in some class of nonlinear functions
Basis Function Expansion \[\mu(\mathbf{x}, \boldsymbol{\theta}) = \sum_{j = 1}^{J} \beta_j b_j(\mathbf{x})\]
\(b_j(\mathbf{x})\) is a pre-specified set of basis functions and \(\boldsymbol{\beta}= (\beta_1, \ldots, \beta_J)^T\) is a vector of coefficients or coordinates wrt to the basis
Taylor Series expansion of \(\mu(\mathbf{x})\) about point \(\chi\) \[\begin{aligned} \mu(x) & = \sum_k \frac{\mu^{(k)}(\chi)}{k!} (x - \chi)^k \\ & = \sum_k \beta_k (x - \chi)^k \end{aligned} \]
polynomial basis
can require a large number of terms to model globally
can have really poor behavior in regions without data
each basis function has a “global” impact
cubic splines \[ b_j(x, \chi_j) = (x - \chi_j)^3_+\]
Gaussian Radial Basis \[ b_j(x, \chi_j) = \exp{\left(\frac{(x - \chi_j)^2}{l^2}\right)}\]
centers of basis functions \(\chi_j\)
width parameter \(l\) controls the scale at which the mean function dies out as a function of \(\mathbf{x}\) from the center
localized basis elements
Multivariate Gaussian Kernel \(g\) with parameters \(\boldsymbol{\omega}= (\boldsymbol{\chi}, \boldsymbol{\Lambda})\) \[ b_j(\mathbf{x}, \boldsymbol{\omega}_j) = g( \boldsymbol{\Lambda}_j^{1/2} (\mathbf{x}- \boldsymbol{\chi}_j)) = \exp\left\{-\frac{1}{2}(\mathbf{x}- \boldsymbol{\chi}_j)^T \boldsymbol{\Lambda}_j (\mathbf{x}- \boldsymbol{\chi}_j)\right\} \]
Gaussian, Cauchy, Exponential, Double Exponential kernels (can be asymmetric)
translation and scaling of wavelet families
basis functions formed from a generator function \(g\) with location and scaling parameters
Mean function \[\mu(\mathbf{x}_i) = \sum_{j}^J b_j(\mathbf{x}_i, \boldsymbol{\omega}_j) \beta_j\]
conditional on the basis elements back to our Bayesian regression model
usually uncertainty about number of basis elements needed
could use BMA or other shrinkage priors
how should coefficients scale as \(J\) increases?
choice of \(J\)?
what about uncertainty in \(\boldsymbol{\omega}\) (locations and scales)?
priors on unknowns (\(J\), \(\{\beta_j\}\), \(\{\boldsymbol{\omega}_j\}\)) induces a prior on functions!
\[\mu(\mathbf{x}) = \sum_{j=0}^{J} b_j(\mathbf{x}, \boldsymbol{\omega}_j) \beta_j = \sum_{j=0}^{J} g(\boldsymbol{\Lambda}^{1/2}(\mathbf{x}- \boldsymbol{\omega}_j)) \beta_j \]
introduce a Lévy measure \(\nu( d\beta, d \boldsymbol{\omega})\)
Poisson distribution \(J \sim \textsf{Poi}(\nu_+)\) where \(\nu_+\equiv \nu(\mathbb{R}\times\boldsymbol{\Omega}) = \iint \nu(\beta, \boldsymbol{\omega}) d \beta \, d\boldsymbol{\omega}\)
conditional prior on \(\beta_j,\boldsymbol{\omega}_j \mid J \mathrel{\mathop{\sim}\limits^{\rm iid}}\pi(\beta, \boldsymbol{\omega}) \propto \nu(\beta,\boldsymbol{\omega})\)
Conditions on \(\nu\) (and \(g\))
See Wolpert, Clyde and Tu (2011) AoS
\(\nu(\beta, \chi) = \beta^{-1} e^{- \beta \eta} \gamma(\chi) d \beta \, d \chi\)
\[\mu(\mathbf{x}) = \sum_{j=0}^{J} b_j(\mathbf{x}, \boldsymbol{\omega}_j) \beta_j = \sum_{j=0}^{J} g(\boldsymbol{\Lambda}^{1/2}(\mathbf{x}- \boldsymbol{\omega}_j)) \beta_j = \int_{\boldsymbol{\Omega}} b(\mathbf{x}, \boldsymbol{\omega}) {\cal L}(d\boldsymbol{\omega})\]
\({\cal L}\) is a random signed measure (generalization of Completely Random Measures) \[ {\cal L}\sim \textsf{Lévy}(\nu) \qquad \qquad {\cal L}(d \boldsymbol{\omega}) = \sum_{j \le J}\beta_j \delta_{\boldsymbol{\omega}_j} (d \boldsymbol{\omega})\]
Lévy-Khinchine Poisson Representation of \({\cal L}\)
Poisson number of support points (possibly infinite!)
random support points of discrete measure \(\{ \boldsymbol{\omega}_j\}\)
random “jumps” \(\beta_j\)
Convenient to think of a random measure as stochastic process where \({\cal L}\) assigns random variables to sets \(A \in \boldsymbol{\Omega}\)
gamma process \[\begin{aligned} \nu(\beta, \boldsymbol{\omega}) & = \beta^{-1} e^{- \beta \eta} \pi(\boldsymbol{\omega}) d \beta \, d \boldsymbol{\omega}\\ {\cal L}(A) & \sim \textsf{Gamma}(\pi(A), \eta) \end{aligned}\]
non-negative coefficients plus non-negative basis functions allows priors on non-negative functions without transformations
\(\alpha\)-Stable process (Cauchy process is \(\alpha = 1\)) \[\nu(\beta, \boldsymbol{\omega}) = c_\alpha |\beta|^{-(\alpha +1)}\ \pi(\boldsymbol{\omega}) \qquad 0 < \alpha < 2 \]
\(\nu^+(\mathbb{R}, \boldsymbol{\Omega}) = \infty\) for both the Gamma and \(\alpha\)-Stable processes
Fine in theory, but problematic for MCMC!
Truncate measure \(\nu\) to obtain a finite expansion:
trans-dimensional MCMC
more parsimonious than “shrinkage” priors or SVM
allows for increasing number of support points as \(n\) increases
control MSE a priori through choice of \(\epsilon\)
no problem with non-normal data, non-negative functions or even discontinuous functions
credible and prediction intervals
robust alternative to Gaussian Process Priors
hard to scale up random scales, locations as dimension of \(\mathbf{x}\) increases
next - Prior Approximation II