Modelling

Sonia Markes

University of Toronto

July 12, 2023

What is a model?

The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work - that is correctly to describe phenomena from a reasonably wide area. Furthermore, it must satisfy certain esthetic criteria - that is, in relation to how much it describes, it must be rather simple.

~ John von Neumann

Eg. Newton’s equations of motion

This is a deterministic model. Nothing is random.

What is a statistical model?

A statistical model is an idealization of the data-generating process, or an abstraction describing how out data came about.

Statistical models always involve stochasticity, that is, an element of randomness.

Is a model true?

All models are wrong, but some are useful.

~ George Box

Reality

Observations
Information
Real
Complex
Messy
Incomplete
Data

Theory

Explanations
Patterns
Ideal
Simplified
Abstract
Improving
Model

Statistical Models

Terminology

We make repeated measurements of the same quantity and collect the observations in a dataset, denoted \(x_1, x_2, ...,x_n\). The observations are modelled as the realization of the random sample \(X_1, X_2, ...,X_n\).

A random sample is a collection of random variables \(X_1, X_2, ...,X_n\) that have the same probability distribution and are mutually independent. That is, \[X_1, X_2,..., X_n \overset{iid}{\sim} F\]

New

The statistical model is \(F\), the probability distribution of each \(X_i\).
The unique distribution that the sample is drawn from is called the “true” distribution.

Parameters

Parameters are an unknown element of a statistical model for observable data. They represent unobservable or unknown properties that we want to know about.

Examples

The difference in cholesterol levels in peoples’ blood who take the new drug vs those who take the placebo.
The mass of the Milky Way.
The average and standard deviation for a 2nd year statistics class.

Parametric models

A parametric model has a partial specification of the probability distribution of \(F\). \[ \{F_{\theta }:\theta \in \Theta \} \]

The type of distribution, \(F_{\theta}\), is called the model distribution.
The parameters, \(\theta\), are called the model parameters.

Parametric models

The parameter value(s) that specify the true distribution are called the “true” parameter(s).

Example

Consider a statistical model \(X_i \sim {N}(\mu,\sigma^2)\).

The model distribution is a normal distribution.
The model parameters are \(\mu\) and \(\sigma^2\).
If the true distribution is \({N}(1,2)\), then the true parameters are \(\mu=1\) and \(\sigma^2=2\).

Connection with data

Recall: LLN

The sample mean \(\bar{X}_n\) converges in probability to \(\mathbb{E}[X_i]=\mu\), the true mean, provided that \(\text{Var}(X_i) = \sigma^2 < \infty\). This is written as \[ \bar{X}_n \overset{p}{\longrightarrow} \mu \]

The LLN shows that the sample mean gets close to the expectation of a probability distribution.

Sample statistics

The sample mean, \(\bar{X}_n\) is an example of a sample statistic.
A sample statistic is an object which depends only on the random sample \(X_1, X_2, ...,X_n\): \[ h(X_1, X_2, ...,X_n) \]
A realization of this sample statistic would be \(h(x_1, x_2, ...,x_n)\).

For large sample size n, the sample mean of most realizations of the random sample is close to the expectation of the corresponding distribution.

Exercise 1 Let \(X_i \overset{iid}{\sim} F_{\mu ,\sigma^2} , \;i = 1, ..., n\) be a random sample from an unknown distribution \(F_{\mu ,\sigma^2}\) with parameters \(\mu = \mathbb{E}[X_i]\) and \(\sigma^2 = \text{Var}(X_i)\). Suppose we estimate the population variance \(\sigma^2\) using the “known mean” sample variance:

\[ \widetilde{s}_n^2 = \frac{1}{n} \sum_{i=1}^{n}( X_i − \mu )^2 \]

Show that \(\widetilde{s}^2_n \overset{p}{\rightarrow} \sigma^2\). What conditions are required on \(X_i\) for this to be true?

Connection with parameters

Linear Regression

Motivation: Paper planes

Say we wanted to predict the horizontal distance a paper airplane flies based on initial horizontal velocity. We could write this as:

\[ \text{distance}=g(\text{velocity}) \]

But there are other possible factors. Add a random component to account for the unknown:

\[ \text{distance}=g(\text{velocity})+\text{random fluctuation} \]

What is the shape of \(g\)?

How do we know which plot would be useful here?

plot_flights <- ggplot(data = paperflights, aes(x = v, y = d)) +
  theme_bw() +
  coord_cartesian(xlim = c(0,8),ylim = c(0,12)) +
  geom_point() +
  labs(x = "Velocity (m/s)", y = "Distance (m)")
plot_flights

plot_flights + geom_smooth(method = 'lm', se = FALSE)

Equation of \(g\)

The data look reasonably linear. So we can represent the model as:

\[ \text{distance}= \alpha + \beta * (\text{velocity}) + \text{random fluctuation} \]

This is a simple linear regression model.

Definition

Consider bivariate data \((x_1,y_1), ...,(x_n,y_n)\). In simple linear regression, we assume \(x_1,...,x_n\) are fixed (not random) and that \(y_1,...,y_n\) are realization of random variables \(Y_1,...,Y_n\) such that

\[ Y_i= \alpha + \beta x_i + U_i \;\; \text{for} \;\; i=1,...,n \]

where \(U_1, ...,U_n\) are independent random variables with \(\mathbb{E}[U_i]=0\) and \(\text{Var}(U_i)=\sigma^2\).

Terminology

\(y=\alpha + \beta x\) is called the regression line.
\(\alpha\) and \(\beta\) represent the intercept and slope, respectively.
\(\alpha\) and \(\beta\) are the model parameters.
\(y\) is the response variable or the dependent variable
\(x\) is the explanatory variable or the independent variable

Simple linear regression means the model has 1 explanatory variable. If there are multiple explanatory variables, then it is multiple linear regression.

Find estimates for the intercept & slope

lm(d ~ v, data = paperflights)


Call:
lm(formula = d ~ v, data = paperflights)

Coefficients:
(Intercept)            v  
    -0.3167       1.4428

Assumptions

Independent & \(\mathbb{E}[U_i]=0\)

⇒ \(Y_i\)’s are equally far away from the line and are scattered about the line in a random way

\(\text{Var}(U_i)=\sigma^2\)

⇒ same amount of scatter or variability about the line everywhere

What would the scatterplot look like if the conditions on \(\{U_i\}\) were violated?

Properties of \(Y_i\)

\(Y_1,...,Y_n\) are independent, but not identically distributed.

\(U_1, ...,U_n\) are independent \(\implies\) \(Y_1,...,Y_n\) are independent

To verify, consider \(\mathbb{E}[Y_i Y_j]\).

How do we know that \(Y_1,...,Y_n\) are not identically distributed?

\[ \mathbb{E}[Y_i]=\alpha + \beta x_i + \mathbb{E}[U_i]=\alpha + \beta x_i \] They have different means!