STA702 Lecture 5
Duke University
Suppose we have normal data with \[Y_i \mid \mu_i, \sigma^2 \overset{ind}{\sim} \textsf{N}(\mu_i, \sigma^2)\]
separate mean for each observation!
Question: How can we possibly hope to estimate all these \(\mu_i\)? One \(y_i\) per \(\mu_i\) and \(n\) observations!
Naive estimator: just consider only using \(y_i\) in estimating and not the other observations.
MLE \(\hat{\mu}_i = y_i\)
Hierarchical Viewpoint: Let’s borrow information from other observations!
Example \(y_i\) is difference in gene expression for the \(i^{\text{th}}\) gene between cancer and control lines
may be natural to think that the \(\mu_i\) arise from some common distribution, \(\mu_i \overset{iid}{\sim} g\)
unbiased but high variance estimators of \(\mu_i\) based on one observation!
little variation in \(\mu_i\)s so a better estimate might be \(\bar{y}\)
Not forced to choose either - what about some weighted average between \(y_i\) and \(\bar{y}\)?
Data Model \[Y_i \mid \mu_i, \sigma^2 \overset{ind}{\sim} \textsf{N}(\mu_i, \sigma^2)\]
Means Model \[\mu_i \mid \mu, \sigma^2_\mu \overset{iid}{\sim} \textsf{N}(\mu, \sigma^2_{\mu})\]
not necessarily a prior!
Now estimate \(\mu_i\) (let \(\phi = 1/\sigma^2\) and \(\phi_{\mu} = 1/\sigma^2_\mu\))
Calculate the “posterior” \(\mu_i \mid y_i, \mu, \phi, \phi_\mu\)
Posterior: \(\mu_i \mid y_i, \mu, \phi, \phi_\mu \overset{ind}{\sim} \textsf{N}(\tilde{\mu}_i, 1/\tilde{\phi}_\mu)\)
estimator of \(\mu_i\) weighted average of data and population parameter \(\mu\) \[\tilde{\mu}_i = \frac{\phi_\mu \mu + \phi y_i} {\phi_\mu + \phi} \qquad \qquad \tilde{\phi}_\mu = \phi + \phi_\mu\]
if \(\phi_\mu\) is large relative to \(\phi\) all of the \(\mu_i\) are close together and benefit by borrowing information
in limit as \(\sigma^2_\mu \to 0\) or \(\phi_\mu \to \infty\) we have \(\tilde{\mu}_i = \mu\) (all means are the same)
if \(\phi_\mu\) is small relative to \(\phi\) little borrowing of information
in the limit as \(\phi_\mu \to 0\) we have \(\tilde{\mu}_i = y_i\)
Note: you often benefit from a hierarchical model, even if its not obvious that the \(\mu_i\) are related!
The MLE for the \(\mu_i\) is just the sample \(y_i\).
\(y_i\) is unbiased for \(\mu_i\) but can have high variability!
the posterior mean is actually biased.
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of \(\mu_i\) toward the prior or provide shrinkage to the mean.
Question
Why would we ever want to do this? Why not just stick with the MLE?
The fact that a biased estimator would do a better job in many estimation/prediction problems can be proven rigorously, and is referred to as Stein’s paradox.
Stein’s result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions i.e. there are other estimators that will come closer to the true value in expectation.
In fact, these are Bayes point estimators (the posterior expectation of the parameter \(\mu_i\)).
Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.
Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.
we don’t know \(\mu\) (or \(\sigma^2\) and \(\sigma^2_\mu\) for that matter)
Find marginal likelihood \(\cal{L}(\mu, \sigma^2, \sigma^2_\mu)\) by integrating out \(\mu_i\) with respect to \(g\) \[\cal{L}(\mu, \sigma^2, \sigma^2_\mu) \propto \prod_{i = 1}^n \int \textsf{N}(y_i; \mu_i, \sigma^2) \textsf{N}(\mu_i; \mu, \sigma^2_\mu) \, d \mu_i\]
Product of predictive distributions for \(Y_i \mid \mu, \sigma^2, \sigma^2_\mu \overset{iid}{\sim} \textsf{N}(\mu, \sigma^2 + \sigma^2_\mu)\)
\[\cal{L}(\mu, \sigma^2, \sigma^2_\mu) \propto \prod_{i = 1}^n (\sigma^2 + \sigma^2_\mu)^{-1/2} \exp \left\{ - \frac{1}{2} \frac{\left(y_i - \mu \right)^2}{\sigma^2 + \sigma^2_\mu }\right\}\]
Find MLE’s
\[\cal{L}(\mu, \sigma^2, \sigma^2_\mu) \propto (\sigma^2 + \sigma^2_\mu)^{-n/2} \exp\left\{ - \frac{1}{2} \sum_{i=1}^n\frac{\left(y_i - \mu \right)^2}{\sigma^2 + \sigma^2_\mu }\right\}\]
MLE of \(\mu\): \(\hat{\mu} = \bar{y}\)
Can we say anything about \(\sigma^2_\mu\)? or \(\sigma^2\) individually?
MLE of \(\sigma^2 + \sigma^2_\mu\) is \[\widehat{\sigma^2 + \sigma^2_\mu} = \frac{\sum(y_i - \bar{y})^2}{n}\]
Assume \(\sigma^2\) is known (say 1) \[\hat{\sigma}^2_\mu = \frac{\sum(y_i - \bar{y})^2}{n} - 1\]
plug in estimates of hyperparameters into the prior and pretend they are known
resulting estimates are known as Empirical Bayes
Estimates of variances may be negative - constrain to 0 on the boundary
underestimates uncertainty
Fully Bayes would put a prior on the unknowns
We know the conditional posterior distribution of \(\mu_i\) given the other parameters, lets work with the marginal likelihood \(\cal{L}(\theta)\)
need a prior \(\pi(\theta)\) for unknown parameters are \(\theta = (\mu, \sigma^2, \sigma^2_\mu)\) (details later)
Posterior \[\pi(\theta \mid y) = \frac{\pi(\theta) \cal{L}(\theta)} {\int_\Theta \pi(\theta) \cal{L}(\theta) \, d\theta} = \frac{\pi(\theta) \cal{L}(\theta)} {m(y)}\]
Problems: Except for simple cases (conjugate models) \(m(y)\) is not available analytically
Appeal to BvM (Bayesian Central Limit Theorem) and approximate \(\pi(\theta \mid y)\) with a Gaussian distribution centered at the posterior mode \(\hat{\theta}\) and asymptotic covariance matrix \[V_\theta = \left[- \frac{\partial^2}{\partial \theta \partial \theta^T} \left\{\log(\pi(\theta)) + \log(\cal{L}(\theta)) \right\} \right]^{-1}\]
related to Laplace approximation to integral (also large sample)
Use normal approximation to find \(\textsf{E}[h(\theta) \mid y]\)
Integral may not exist in closed form (non-linear functions)
use numerical quadrature (doesn’t scale up)
Stochastic methods of integration
Stochastic integration \[\textsf{E}[h(\theta) \mid y] = \int_\Theta h(\theta) \pi(\theta \mid y) \, d\theta \approx \frac{1}{T}\sum_{t=1}^{T} h(\theta^{(t)}) \qquad \theta^{(t)} \sim \pi(\theta \mid y)\]
what if we can’t sample from the \(\pi(\theta \mid y)\) but can sample from some distribution \(q()\) \[\textsf{E}[h(\theta) \mid y] = \int_\Theta h(\theta) \frac{\pi(\theta \mid y)}{q(\theta)} q(\theta)\, d\theta \approx \frac{1}{T}\sum_{t=1}^{T} h(\theta^{(t)}) \frac{\pi(\theta^{(t)} \mid y)} {q(\theta^{(t)})} \qquad\] where \(\theta^{(t)} \sim q(\theta)\)
Without the \(m(y)\) in \(\pi(\theta \mid y)\) we just have \(\pi(\theta \mid y) \propto \pi(\theta) \cal{L}(\theta)\)
use twice for numerator and denominator
Estimate of \(m(y)\) \[m(y) \approx \frac{1}{T} \sum_{t=1}^{T} \frac{\pi(\theta^{(t)}) \cal{L}(\theta^{(t)})}{q(\theta^{(t)})} \qquad \theta^{(t)} \sim q(\theta)\]
Ratio estimator of \(\textsf{E}[h(\theta) \mid y]\) \[\textsf{E}[h(\theta) \mid y] \approx \frac{\sum_{t=1}^{T} h(\theta^{(t)}) \frac{\pi(\theta^{(t)}) \cal{L}(\theta^{(t)})}{q(\theta^{(t)})}} { \sum_{t=1}^{T} \frac{\pi(\theta^{(t)}) \cal{L}(\theta^{(t)})}{q(\theta^{(t)})}} \qquad \theta^{(t)} \sim q(\theta)\]
Weighted average with importance weights \(w(\theta^{(t)}) \propto \frac{\pi(\theta^{(t)}) \cal{L}(\theta^{(t)})}{q(\theta^{(t)})}\) \[\textsf{E}[h(\theta) \mid y] \approx \sum_{t=1}^{T} h(\theta^{(t)}) w(\theta^{(t)})/\sum_{t=1}^T w(\theta^{(t)}) \qquad \theta^{(t)} \sim q(\theta)\]
if \(q()\) puts too little mass in regions with high posterior density, we can have some extreme weights
optimal case is that \(q()\) is as close as possible to the posterior so that all weights are constant
Estimate may have large variance
Problems with finding a good \(q()\) in high dimensions \((d > 20)\) or with skewed distributions
Question
How do we draw samples only using evaluations of the prior and likelihood in higher dimensional settings?
construct a Markov chain \(\theta^{(t)}\) in such a way the the stationary distribution of the Markov chain is the posterior distribution \(\pi(\theta \mid y)\)! \[\theta^{(0)} \overset{k}{\longrightarrow} \theta^{(1)} \overset{k}{\longrightarrow} \theta^{(2)} \cdots\]
\(k_t(\theta^{(t-1)} ; \theta^{(t)})\) transition kernel
initial state \(\theta^{(0)}\)
choose some nice \(k_t\) such that \(\theta^{(t)} \to \pi(\theta \mid y)\) as \(t \to \infty\)
biased samples initially but get closer to the target
Metropolis Algorithm (1950’s)
From a sampling perspective, we need to have a large sample or group of values, \(\theta^{(1)}, \ldots, \theta^{(S)}\) from \(\pi(\theta \mid y)\) whose empirical distribution approximates \(\pi(\theta \mid y)\).
for any two sets \(A\) and \(B\), we want \[\frac{\dfrac{\# \theta^{(s)} \in A}{S}}{\dfrac{\# \theta^{(s)} \in B}{S} } = \dfrac{\# \theta^{(s)} \in A}{\# \theta^{(s)} \in B} \approx \dfrac{\pi(\theta \in A \mid y)}{\pi(\theta \in B \mid y)}\]
Suppose we have a working group \(\theta^{(1)}, \ldots, \theta^{(s)}\) at iteration \(s\), and need to add a new value \(\theta^{(s+1)}\).
Consider a candidate value \(\theta^\star\) that is close to \(\theta^{(s)}\)
Should we set \(\theta^{(s+1)} = \theta^\star\) or not?
look at the ratio \[ \begin{split} M & = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)} = \frac{\dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y)} } {\dfrac{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})}{p(y)}}\\ \\ & = \dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})} \end{split} \]
Intuition: \(\theta^{(s)}\) is already a part of the density we desire and the density at \(\theta^\star\) is even higher than the density at \(\theta^{(s)}\).
Action: set \(\theta^{(s+1)} = \theta^\star\)
Intuition: relative frequency of values in our group \(\theta^{(1)}, \ldots, \theta^{(s)}\) “equal” to \(\theta^\star\) should be \(\approx M = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)}\).
For every \(\theta^{(s)}\), include only a fraction of an instance of \(\theta^\star\).
Action: set \(\theta^{(s+1)} = \theta^\star\) with probability \(M\) and \(\theta^{(s+1)} = \theta^{(s)}\) with probability \(1-M\).
Where should the proposed value \(\theta^\star\) come from?
Sample \(\theta^\star\) close to the current value \(\theta^{(s)}\) using a symmetric proposal distribution \(\theta^\star \sim q(\theta^\star \mid \theta^{(s)})\)
\(q()\) is actually a “family of proposal distributions”, indexed by the specific value of \(\theta^{(s)}\).
Here, symmetric means that \(q(\theta^\star \mid \theta^{(s)}) = q(\theta^{(s)} \mid \theta^\star)\).
Common choice \[\textsf{N}(\theta^\star; \theta^{(s)}, \delta^2 \Sigma)\] with \(\Sigma\) based on the approximate \(\textsf{Cov}(\theta \mid y)\) and \(\delta = 2.44/\text{dim}(\theta)\) or \[\text{Unif}(\theta^\star; \theta^{(s)} - \delta, \theta^{(s)} + \delta)\]
The algorithm proceeds as follows:
Given \(\theta^{(1)}, \ldots, \theta^{(s)}\), generate a candidate value \(\theta^\star \sim q(\theta^\star \mid \theta^{(s)})\).
Compute the acceptance ratio \[\begin{split} M & = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)} = \dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})}. \end{split}\]
Set \[\begin{eqnarray*} \theta^{(s+1)} = \left\{ \begin{array}{ll} \theta^\star & \quad \text{with probability} \quad \text{min}(M,1) \\ \theta^{(s)} & \quad \text{with probability} \quad 1 - \text{min}(M,1) \\ \end{array} \right. \end{eqnarray*}\] equivalent to sampling \(u \sim U(0,1)\) independently and setting \[\begin{eqnarray*} \theta^{(s+1)} = \left\{ \begin{array}{ll} \theta^\star & \quad \text{if} \quad u < M \\ \theta^{(s)} & \quad \text{if} \quad \text{otherwise} \\ \end{array} \right. . \end{eqnarray*} \]
Acceptance probability is \[M = \min \left\{ 1, \frac{\pi(\theta^\star) \cal{L}(\theta^\star)} {\pi(\theta^{(s)}) \cal{L}(\theta^{(s)})}\right\}\]
ratio of posterior densities where normalizing constant cancels!
The Metropolis chain ALWAYS moves to the proposed \(\theta^\star\) at iteration \(s+1\) if \(\theta^\star\) has higher target density than the current \(\theta^{(s)}\).
Sometimes, it also moves to a \(\theta^\star\) value with lower density in proportion to the density value itself.
This leads to a random, Markov process that naturally explores the space according to the probability defined by \(\pi(\theta \mid y)\), and hence generates a sequence that, while dependent, eventually represents draws from \(\pi(\theta \mid y)\) (stationary distribution of the Markov Chain).