University of Toronto
July 26, 2023
Recall: The Central Limit Theorem states that the sampling distribution of the sample mean is approximately normal for large \(n\). That is,
\[ \bar{X}_n \overset{D}{\longrightarrow}Y \;\; \text{where} \;\; Y\sim \text{N}\left( \mu,\frac{\sigma^2}{n} \right) \]
If \(\widehat\mu = \bar{X}_n\) where \(\mu= \mathbb{E}[X_i]\) and \(\sigma^2=Var(X_i)\), then the sampling distribution of is given by the CLT.
\[ \widehat\mu \sim \text{N}\left( \mu, \frac{\sigma^2}{n} \right) \]
So far, we’ve focused on what are called point estimates, our “best guesses” for a parameter value, based on the data we have. But how confident are we in those estimates?
Recall: Estimates are functions of the data, but those functions applied to the random sample are called estimators.
The square root of the variance of an estimator is known as the standard error.
If we know the sampling distribution of an estimator, we can use it for more than finding a estimate.
The bootstrap method is a way to learn more about the sampling distribution of an estimator, while making fewer assumptions.
Approximate the sampling distribution by simulating many datasets from a good model of the distribution of the data and calculate a new estimate of the statistic for each simulated dataset.
A bootstrap algorithm has the following form:
Result: Many values of the bootstrap statistic. The distribution of the bootstrap statistic is the bootstrap distribution.
Input
Repeat the following for each of \(b=1,...,B\):
Output
\(\widehat{\mu}_1^*, ..., \widehat{\mu}_B^*\), which are a sample from the sampling distribution of \(\widehat\mu\)
This algorithm is equivalent to sampling from empirical cumulative distribution function. Recall that the eCDF is defined by
\[ \widehat{F}_n(x)=\frac{\big\lvert \{x_i \vert x_i\leq x \} \big\rvert}{n} \]
Notice that no parametric family needs to be specified for the distribution.
[1] 6.00168
Is that what we expected?
[1] 6.162984
Do we know what the sampling distribution of the sample mean should be for a \(\Gamma\) distribution?
set.seed(238)
B <- 1000 # number of bootstrap samples
n <- length(samp)
bootmeans <- numeric(B) # vector where we'll store B bootstrap means
for (i in 1:B){
bootsamp <- sample(samp, n, replace = TRUE) # sample from the data
bootmeans[i] <- mean(bootsamp) # compute bootstrap stat
}
# output
ggplot(tibble(x = bootmeans), aes(x = x)) +
geom_histogram(aes(y = after_stat(density)), bins = 23, colour = "black", fill = "grey") +
theme_bw() +
labs(title = "Bootstrap distribution of sample means", y = "density") +
geom_vline(aes(xintercept = mu), colour = "blue") +
geom_vline(aes(xintercept = muhat), colour = "purple")
⚠️ The bootstrap does not give a better estimate than the original data because the bootstrap distribution is centered around a statistic calculated from the data. Drawing thousands of bootstrap observations from the original data is not like drawing observations from the underlying population (i.e. the theoretical world). It does not create new data.
The bootstrap distribution has approximately the same shape and spread as the sampling distribution, but the center of the bootstrap distribution is the center of the original data (not the center of the theoretical world). So we often use the bootstrap to estimate the sampling distribution of \(\widehat{\theta}-\theta\), the error in our estimate.
Input
Repeat the following for each of \(b=1,...,B\):
Output
\((\widehat{\mu}-\mu)_1^*, ..., (\widehat{\mu}-\mu)_B^*\), which are a sample from the sampling distribution of \((\widehat{\mu}-\mu)\)
We computed \((\widehat{\mu}-\mu)_b^*\) without knowing \(\mu\) (!!)
set.seed(238)
B <- 1000 # number of bootstrap samples
n <- length(samp)
bootcmeans_emp <- 1:B %>%
map(~sample(samp, n, replace = TRUE)) %>% # sample from the data
map(~mean(.x) - muhat) %>% # compute bootstrap stat & center
reduce(c)
ggplot(tibble(x = bootcmeans_emp), aes(x = x)) +
geom_histogram(aes(y = after_stat(density)), bins = 23, colour = "black", fill = "grey") +
theme_bw() +
labs(title = "Centered bootstrap distribution of sample means", y = "density")
In this course, please do not use the boot
package or any other similar packages.
Suppose we can make a reasonable assumption about the shape of the distribution of \(F\). Even if it’s not totally correct, it allows the bootstrap to be used in situations where it might fail, particularly for smaller sample sizes.
Input
Repeat the following for each of \(b=1,...,B\):
Output
\(\widehat{\theta}_1^*, ..., \widehat{\theta}_B^*\), which are a sample from the sampling distribution of \(\widehat\theta\)
set.seed(238)
alphahat <- (mean(samp))^2 / var(samp) # estimate alpha
bootcmeans_par <- numeric(B) # vector where we'll store B bootstrap means
for (i in 1:B){
bootsamp <- rgamma(n, shape = alphahat, rate = beta) # bootstrap sample with alphahat
bootcmeans_par[i] <- mean(bootsamp) - alphahat/beta # bootstrap estimate
}
What can we do besides plot the sampling distributions?
Estimate \(\widehat\mu-\mu\):