The Normal Model & Prior/Posterior Predictive Distributions

STA 702: Lecture 3

Merlise Clyde

Duke University

Outline

Normal Model
Predictive Distributions
Prior Predictive; useful for prior elicitation
Posterior Predictive; predicting/forecasting future events
Comparing Estimators

Normal Model Setup

Suppose we have independent observations \[\mathbf{y} = (y_1,y_2,\ldots,y_n)^T\] where each \(Y_i \mid \theta, \sigma^2 \stackrel{iid}{\sim} \textsf{N}(\theta, \sigma^2)\)
We will see that it is more convenient to work with \(\tau = 1/\sigma^2\) (precision)
reparameterizing the model for the data we have \[Y_i \mid \theta, \tau \sim \mathcal{N}(\theta, \tau^{-1})\]
for simplicity we will treat \(\tau\) as known initially.
Need to specify a prior for \(\theta\) on \(\mathbb{R}\)

Prior for a Normal Mean

Natural choice is a Normal/Gaussian distribution (Conjugate prior) \[\theta \sim \textsf{N}(\theta_0, 1/\tau_0)\]
\(\theta_0\) is the prior mean - best guess for \(\theta\) using information other than \(\mathbf{y}\)
\(\tau_0\) is the prior precision and expresses our certainty about this guess
one notion of non-informative is to let \(\tau_0 \to 0\)
better justification is as Jeffreys’ prior (uniform measure)
\(\pi(\theta) \propto 1\)
parameterization invariant and invariant to location/scale changes in the data (group invariance)

Exercise for the Energetic Student

You should be able to derive Jeffreys prior!

Posterior Distribution (1 observaton)

Posterior \[p(\theta \mid y) \propto \exp\left\{- \frac 1 2 [\tau (y - \theta) ^2 + \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta\]
Quadratic in exponential term: \(\tau_0(\theta - \theta_0)^2 = \tau_0 \theta^2 - 2 \tau_0 \theta_0 \theta + \tau_0 \theta_0^2\)
- Expand quadratics, regroup and read off precision from quadtric term in \(\theta\) and mean from linear term in \(\theta\)
posterior precision is the sum of prior precision and data precision \(\tau_0 + \tau\)
posterior mean \(\hat{\theta} = \frac{\tau_0} {\tau_0 + \tau} \theta_0 + \frac{\tau}{\tau_0 + \tau} y\); precision weighted average of prior mean and MLE
conjugate family updating \(\theta \mid y \sim \textsf{N} \left(\hat{\theta}, \frac{1}{\tau_0 + \tau} \right)\)

Marginal Distribution

Recall that the marginal distribution is \[p({y}) = p(y_1,\ldots,y_n) = \int_\Theta p(y_1,\ldots,y_n \mid \theta) \pi(\theta)\, d\theta\]
this is also called the prior predictive distribution and is independent of any unknown parameters
We may care about making predictions before we even see any data.
This is often useful as a way to see if the sampling distribution or prior we have chosen is appropriate, after integrating out all unknown parameters.

Prior Predictive for a Single Case

\[\begin{split} p(y) & \propto \int_\mathbb{R} p(y \mid \theta) \pi(\theta) \, d\theta \\ & \propto \int_\mathbb{R}\exp\left\{- \frac 1 2 \tau (y - \theta) ^2 \right\} \exp\left\{- \frac 1 2 \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta \end{split}\]

Integration

Expand quadratics in exp terms
Group terms with \(\theta^2\) and \(\theta\)
Read off posterior precision and
posterior mean
Complete the square
Integrate out \(\theta\) to obtain marginal!

Linear combinations of Normals are Normal! \[Y \stackrel{D}{=} \theta + \epsilon, \quad \epsilon \sim N(0, 1/\tau) \quad \theta \sim N(\theta_0, 1/\tau_0)\]
Find Mean of sum
Find Variance of sum
Marginal \(Y \sim N(\theta_0, 1/\tau_0 + 1/\tau)\)

Prior Predictive

marginal distribution for \(Y\) (prior predictive) \[Y \sim \textsf{N}\left(\theta_0, \frac{1}{\tau_0} + \frac{1}{\tau}\right) \text{ or } \textsf{N}(\theta_0, \sigma^2 + \sigma^2_0)\]
two sources of variability: data variability and prior variability
useful to think about observable quantities when choosing the prior
sample directly from the prior predictive and assess whether the samples are consistent with our prior knowledge
if not, go back and modify the prior & repeat
sequential substitution sampling (repeat \(T\) times)
1. draw \(\theta^{(t)} \sim \pi(\theta)\)
2. draw \(y^{(t)} \mid \theta^{(t)} \sim p(y \mid \theta^{(t)})\)
takes into account uncertain about \(\theta\) and variability in \(Y\)!

Posterior Updating

Sequential updating using the previous result as our prior!
New prior after seeing 1 observation is \[\textsf{N}(\theta_1, 1/\tau_1)\]
prior mean weighted average \[\theta_1 \equiv \frac{\tau_0 \theta_0 + \tau y_1}{\tau_0 + \tau_1}\]
prior precision after 1 observation \[\tau_1 \equiv \tau_0 + \tau\]
prior variance is now \(\sigma^2_1 = 1/\tau_1\)

Posterior Predictive for \(y_2\) given \(y_1\)

Conditional \(p(y_2 \mid y_1) = p(y_2, y_1)/p(y_1)\) (Hard way!)
Use latent variable representation \[p(y_2 \mid y_1) = \int_\Theta \frac{p(y_2, \mid \theta) p( y_1 \mid \theta ) \pi(\theta) \, d\theta}{p(y_1)}\]
simplify to previous problem and use results \[p(y_2 \mid y_1) = \int_\Theta p(y_2 \mid \theta) \pi(\theta \mid y_1) \, d\theta\]
(Posterior) Predictive \[Y_2 \mid y_1 \sim \textsf{N}(\theta_1, \sigma^2 + \sigma^2_1)\]

Iterated Expectations

Based on expressions we have an exponential of a quadratic in \(y_2\) so know that distribution is Gaussian

Find the mean and variance using iterated expectations:
mean \[\textsf{E}[Y_2 \mid y_1] = \textsf{E}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]\]
Conditional Variance \(\textsf{Var}[Y_2 \mid y_1]\)
Iterated expectations (prove!) \[\textsf{Var}[Y_2 \mid y_1] = \textsf{E}_{\theta \mid y_1}[\textsf{Var}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1] + \textsf{Var}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]\]

Updated Posterior for \(\theta\)

\[p(\theta \mid y_1, y_2) \propto p(y_2 \mid \theta) p(y_1 \mid \theta) \pi(\theta)\]

\[p(\theta \mid y_1, y_2) \propto p(y_2 \mid \theta) p(\theta \mid y_1)\]

Apply previous updating rules

new posterior mean \[\theta_2 = \frac{\tau_1 \theta_1 + \tau y_2}{\tau_1 + \tau} = \frac{\tau_0 \theta_0 + 2 \tau \bar{y}} {\tau_0 + 2 \tau}\]
new precision \[ \tau_2 = \tau_1 + \tau = \tau_0 + 2 \tau\]

After \(n\) observations

Posterior for \(\theta\) \[\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)\]
Posterior Predictive Distribution for \(Y_{n+1}\) \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)\]
Shrinkage of the MLE to the prior mean
More accurate estimation of \(\theta\) as \(n \to \infty\) (reducible error)
Cannot reduce the error for prediction \(Y_{n+1}\) due to \(\sigma^2\)
predictive distribution for a next observation given everything we know - prior and likelihood

Results with Jeffreys’ Prior

What if \(\tau_0 \to 0\)? (or \(\sigma^2_0 \to \infty\))
Prior predictive \(\textsf{N}(\theta_0, \sigma^2_0 + \sigma^2 )\) (not proper in the limit)
Posterior for \(\theta\) (formal posterior) \[\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)\]

\[\to \qquad \theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, \frac{1}{n \tau} \right)\]

Recovers the MLE as the posterior mode!
Posterior variance of \(\theta = \sigma^2/n\) (same as variance of the MLE)

Posterior Predictive Distribution

Posterior predictive distribution for \(Y_{n+1}\) \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}} {\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)\]
Under Jeffreys’ prior \[Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, \sigma^2 (1 + \frac{1}{n} )\right)\]
Captures extra uncertainty due to not knowing \(\theta\) (compared to plug-in approach where we plug in MLE in sampling model!

Comparing Estimators

Expected loss (from frequentist perspective) of using Bayes Estimator
Posterior mean is optimal under squared error loss (min Bayes Risk) [also absolute error loss]
Compute Mean Square Error (or Expected Average Loss) \[\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right]\]

\[ = \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})\]

For the MLE \(\bar{Y}\) this is just the variance of \(\bar{Y}\) or \(\sigma^2/n\)

MSE for Bayes

Frequentist Risk \[\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right] = \textsf{MSE} = \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})\]
Bias of Bayes Estimate \[\textsf{E}_{\bar{Y} \mid \theta}\left[ \frac{\tau_0 \theta_0 + \tau n \bar{Y}} {\tau_0 + \tau n}\right] = \frac{\tau_0(\theta_0 - \theta)}{\tau_0 + \tau n}\]
Variance \[\textsf{Var}\left(\frac{\tau_0 \theta_0 + \tau n \bar{Y}}{\tau_0 + \tau n} - \theta \mid \theta \right) = \frac{\tau n}{(\tau_0 + \tau n)^2}\]
(Frequentist) expected Loss when truth is \(\theta\) \[\textsf{MSE} = \frac{\tau_0^2(\theta - \theta_0)^2 + \tau n}{(\tau_0 + \tau n)^2}\]

Plot

Behavior ?

Updating with \(n\) Observations

Can update sequentially as before -or-
We can use the \(\cal{L}(\theta)\) based on \(n\) observations and repeat completing the square with the original prior \(\theta \sim \textsf{N}(\theta_0, 1/\tau_0)\)
same answer!
The likelihood for \(\theta\) is proportional to the sampling model \[p(y \mid \theta,\tau) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} \tau^{\frac{1}{2}} \exp{\left\{-\frac{1}{2} \tau (y_i-\theta)^2\right\}}\]

Exercise

Rewrite in terms of sufficient statistics!

Exercises for Practice

Exercise 1

Use \(\cal{L}(\theta)\) based on \(n\) observations and \(\pi(\theta)\) to find \(\pi(\theta \mid y_1, \ldots, y_n)\) based on the sufficient statistics

Exercise 2

Use \(\pi(\theta \mid y_1, \ldots, y_n)\) to find the posterior predictive distribution for \(Y_{n+1}\)

Simplification

\[\begin{split} \cal{L}(\theta) & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n (y_i-\theta)^2\right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n \left[ (y_i-\bar{y}) - (\theta - \bar{y}) \right]^2 \right\}\\ \\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + \sum_{i=1}^n(\theta - \bar{y})^2 \right] \right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + n(\theta - \bar{y})^2 \right] \right\}\\ & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau s^2(n-1) \right\} \ \exp\left\{-\frac{1}{2} \tau n(\theta - \bar{y})^2 \right\} \end{split}\]