STA 702: Lecture 3
Duke University
Normal Model
Predictive Distributions
Prior Predictive; useful for prior elicitation
Posterior Predictive; predicting/forecasting future events
Comparing Estimators
Suppose we have independent observations \[\mathbf{y} = (y_1,y_2,\ldots,y_n)^T\] where each \(Y_i \mid \theta, \sigma^2 \stackrel{iid}{\sim} \textsf{N}(\theta, \sigma^2)\)
We will see that it is more convenient to work with \(\tau = 1/\sigma^2\) (precision)
reparameterizing the model for the data we have \[Y_i \mid \theta, \tau \sim \mathcal{N}(\theta, \tau^{-1})\]
for simplicity we will treat \(\tau\) as known initially.
Need to specify a prior for \(\theta\) on \(\mathbb{R}\)
Natural choice is a Normal/Gaussian distribution (Conjugate prior) \[\theta \sim \textsf{N}(\theta_0, 1/\tau_0)\]
\(\theta_0\) is the prior mean - best guess for \(\theta\) using information other than \(\mathbf{y}\)
\(\tau_0\) is the prior precision and expresses our certainty about this guess
one notion of non-informative is to let \(\tau_0 \to 0\)
better justification is as Jeffreys’ prior (uniform measure)
\(\pi(\theta) \propto 1\)
parameterization invariant and invariant to location/scale changes in the data (group invariance)
Exercise for the Energetic Student
You should be able to derive Jeffreys prior!
Posterior \[p(\theta \mid y) \propto \exp\left\{- \frac 1 2 [\tau (y - \theta) ^2 + \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta\]
Quadratic in exponential term: \(\tau_0(\theta - \theta_0)^2 = \tau_0 \theta^2 - 2 \tau_0 \theta_0 \theta + \tau_0 \theta_0^2\)
posterior precision is the sum of prior precision and data precision \(\tau_0 + \tau\)
posterior mean \(\hat{\theta} = \frac{\tau_0} {\tau_0 + \tau} \theta_0 + \frac{\tau}{\tau_0 + \tau} y\); precision weighted average of prior mean and MLE
conjugate family updating \(\theta \mid y \sim \textsf{N} \left(\hat{\theta}, \frac{1}{\tau_0 + \tau} \right)\)
Recall that the marginal distribution is \[p({y}) = p(y_1,\ldots,y_n) = \int_\Theta p(y_1,\ldots,y_n \mid \theta) \pi(\theta)\, d\theta\]
this is also called the prior predictive distribution and is independent of any unknown parameters
We may care about making predictions before we even see any data.
This is often useful as a way to see if the sampling distribution or prior we have chosen is appropriate, after integrating out all unknown parameters.
\[\begin{split} p(y) & \propto \int_\mathbb{R} p(y \mid \theta) \pi(\theta) \, d\theta \\ & \propto \int_\mathbb{R}\exp\left\{- \frac 1 2 \tau (y - \theta) ^2 \right\} \exp\left\{- \frac 1 2 \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta \end{split}\]
Integration
Expand quadratics in exp terms
Group terms with \(\theta^2\) and \(\theta\)
Read off posterior precision and
posterior mean
Complete the square
Integrate out \(\theta\) to obtain marginal!
Linear combinations of Normals are Normal! \[Y \stackrel{D}{=} \theta + \epsilon, \quad \epsilon \sim N(0, 1/\tau) \quad \theta \sim N(\theta_0, 1/\tau_0)\]
Find Mean of sum
Find Variance of sum
Marginal \(Y \sim N(\theta_0, 1/\tau_0 + 1/\tau)\)
marginal distribution for \(Y\) (prior predictive) \[Y \sim \textsf{N}\left(\theta_0, \frac{1}{\tau_0} + \frac{1}{\tau}\right) \text{ or } \textsf{N}(\theta_0, \sigma^2 + \sigma^2_0)\]
two sources of variability: data variability and prior variability
useful to think about observable quantities when choosing the prior
sample directly from the prior predictive and assess whether the samples are consistent with our prior knowledge
if not, go back and modify the prior & repeat
sequential substitution sampling (repeat \(T\) times)
draw \(\theta^{(t)} \sim \pi(\theta)\)
draw \(y^{(t)} \mid \theta^{(t)} \sim p(y \mid \theta^{(t)})\)
takes into account uncertain about \(\theta\) and variability in \(Y\)!
Sequential updating using the previous result as our prior!
New prior after seeing 1 observation is \[\textsf{N}(\theta_1, 1/\tau_1)\]
prior mean weighted average \[\theta_1 \equiv \frac{\tau_0 \theta_0 + \tau y_1}{\tau_0 + \tau_1}\]
prior precision after 1 observation \[\tau_1 \equiv \tau_0 + \tau\]
prior variance is now \(\sigma^2_1 = 1/\tau_1\)
Conditional \(p(y_2 \mid y_1) = p(y_2, y_1)/p(y_1)\) (Hard way!)
Use latent variable representation \[p(y_2 \mid y_1) = \int_\Theta \frac{p(y_2, \mid \theta) p( y_1 \mid \theta ) \pi(\theta) \, d\theta}{p(y_1)}\]
simplify to previous problem and use results \[p(y_2 \mid y_1) = \int_\Theta p(y_2 \mid \theta) \pi(\theta \mid y_1) \, d\theta\]
(Posterior) Predictive \[Y_2 \mid y_1 \sim \textsf{N}(\theta_1, \sigma^2 + \sigma^2_1)\]
Based on expressions we have an exponential of a quadratic in \(y_2\) so know that distribution is Gaussian
Find the mean and variance using iterated expectations:
mean \[\textsf{E}[Y_2 \mid y_1] = \textsf{E}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]\]
Conditional Variance \(\textsf{Var}[Y_2 \mid y_1]\)
Iterated expectations (prove!) \[\textsf{Var}[Y_2 \mid y_1] = \textsf{E}_{\theta \mid y_1}[\textsf{Var}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1] + \textsf{Var}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]\]
\[p(\theta \mid y_1, y_2) \propto p(y_2 \mid \theta) p(y_1 \mid \theta) \pi(\theta)\]
\[p(\theta \mid y_1, y_2) \propto p(y_2 \mid \theta) p(\theta \mid y_1)\]
Apply previous updating rules
new posterior mean \[\theta_2 = \frac{\tau_1 \theta_1 + \tau y_2}{\tau_1 + \tau} = \frac{\tau_0 \theta_0 + 2 \tau \bar{y}} {\tau_0 + 2 \tau}\]
new precision \[ \tau_2 = \tau_1 + \tau = \tau_0 + 2 \tau\]
Posterior for \(\theta\) \[\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)\]
Posterior Predictive Distribution for \(Y_{n+1}\) \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)\]
Shrinkage of the MLE to the prior mean
More accurate estimation of \(\theta\) as \(n \to \infty\) (reducible error)
Cannot reduce the error for prediction \(Y_{n+1}\) due to \(\sigma^2\)
predictive distribution for a next observation given everything we know - prior and likelihood
What if \(\tau_0 \to 0\)? (or \(\sigma^2_0 \to \infty\))
Prior predictive \(\textsf{N}(\theta_0, \sigma^2_0 + \sigma^2 )\) (not proper in the limit)
Posterior for \(\theta\) (formal posterior) \[\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)\]
\[\to \qquad \theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, \frac{1}{n \tau} \right)\]
Recovers the MLE as the posterior mode!
Posterior variance of \(\theta = \sigma^2/n\) (same as variance of the MLE)
Posterior predictive distribution for \(Y_{n+1}\) \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)\]
Under Jeffreys’ prior \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, \sigma^2 (1 + \frac{1}{n} )\right)\]
Captures extra uncertainty due to not knowing \(\theta\) (compared to plug-in approach where we plug in MLE in sampling model!
Expected loss (from frequentist perspective) of using Bayes Estimator
Posterior mean is optimal under squared error loss (min Bayes Risk) [also absolute error loss]
Compute Mean Square Error (or Expected Average Loss) \[\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right]\]
\[ = \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})\]
Frequentist Risk \[\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right] = \textsf{MSE} = \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})\]
Bias of Bayes Estimate \[\textsf{E}_{\bar{Y} \mid \theta}\left[ \frac{\tau_0 \theta_0 + \tau n \bar{Y}} {\tau_0 + \tau n}\right] = \frac{\tau_0(\theta_0 - \theta)}{\tau_0 + \tau n}\]
Variance \[\textsf{Var}\left(\frac{\tau_0 \theta_0 + \tau n \bar{Y}}{\tau_0 + \tau n} - \theta \mid \theta \right) = \frac{\tau n}{(\tau_0 + \tau n)^2}\]
(Frequentist) expected Loss when truth is \(\theta\) \[\textsf{MSE} = \frac{\tau_0^2(\theta - \theta_0)^2 + \tau n}{(\tau_0 + \tau n)^2}\]
Behavior ?
Can update sequentially as before -or-
We can use the \(\cal{L}(\theta)\) based on \(n\) observations and repeat completing the square with the original prior \(\theta \sim \textsf{N}(\theta_0, 1/\tau_0)\)
same answer!
The likelihood for \(\theta\) is proportional to the sampling model \[p(y \mid \theta,\tau) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} \tau^{\frac{1}{2}} \exp{\left\{-\frac{1}{2} \tau (y_i-\theta)^2\right\}}\]
Exercise
Rewrite in terms of sufficient statistics!
Exercise 1
Use \(\cal{L}(\theta)\) based on \(n\) observations and \(\pi(\theta)\) to find \(\pi(\theta \mid y_1, \ldots, y_n)\) based on the sufficient statistics
Exercise 2
Use \(\pi(\theta \mid y_1, \ldots, y_n)\) to find the posterior predictive distribution for \(Y_{n+1}\)
\[\begin{split} \cal{L}(\theta) & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n (y_i-\theta)^2\right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n \left[ (y_i-\bar{y}) - (\theta - \bar{y}) \right]^2 \right\}\\ \\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + \sum_{i=1}^n(\theta - \bar{y})^2 \right] \right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + n(\theta - \bar{y})^2 \right] \right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau s^2(n-1) \right\} \ \exp\left\{-\frac{1}{2} \tau n(\theta - \bar{y})^2 \right\} \end{split}\]