University of Toronto
July 6, 2023
The data do not speak for themselves.
\[ \bar{x}_n = \frac{x_1+x_2+...+x_n}{n} \]
The arithmetic mean, or simply, the average. Denoted \(\bar{x}_n\) or, simply \(\bar{x}\).
The sample median is the middle value of a sorted dataset. Denoted \(Med_n=Med(x_1,x_2,...,x_n)\).
Find the mean and median of the following sets:
Ex 1: Find the mean and median of \(\{ 6,4,7,5,3\}\)
Ex. 2: Find the mean and median of \(\{3,5,9,2,4,7\}\)
Ex. 3: Find the mean and median of \(\{7,8,8,1\}\)
The dataset arranged is ascending order is called the order statistics, denoted \(\{x_{(1)},x_{(2)},...,x_{(n)}\}\), such that \[x_{(k-1)}\leq x_{(k)}\leq x_{(k+1)} \;\; \forall \;\; k\in(1,n)\]
\[ \min (x_1, ..., x_n) =x_{(1)} \]
\[ \max (x_1, ..., x_n)=x_{(n)} \]
\[ {Med}_n = \begin{cases} x_{\left(\frac{n+1}{2}\right)} &\text{if } n \in 2k+1 \\ \frac{1}{2} \left( x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)} \right) &\text{if } n \in 2k \end{cases} \] where \(k\in \mathbb{N}\)
The \(p^{th}\) empirical quantile, denoted \(q_n(p)\), is the number such that the proportion of the dataset below \(q_n(p)\) is \(p\).
For the set of integers \([1,11]\), compute the following:
The range of a dataset is the difference between the maximum and minimum values. \[ Max-Min=x_{(n)}-x_{(1)} \]
The interquartile range is the difference between the upper and lower quartiles. \[ IQR = q_n(0.75) - q_n(0.25) \]
Median of absolute deviation
\[ MAD(x_1,...,x _n)= Med\left( |x_1-Med_n|,...,|x_n-Med_n| \right) \]
Sample variance
\[ s_n^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x}_n)^2 \] The sample standard deviation is \(s_n\), the square root of the sample variance.
Compute the range, interquartile range, MAD, and standard deviation for \(100\) observations from the following distributions:
\(W \sim U(0,1)\)
\(X \sim N(0,1)\)
\(Y \sim Exp(1)\)
One method of defining outliers is to use the IQR measure of variability.
\[ x_i\leq q_n(0.25) - 1.5*IQR \]
\[ x_i \geq q_n(0.75) + 1.5*IQR \]
What do boxplots not show?
Put data into \(m\) bins, \(B_1, ..., B_m\), \[ B_i=\left[x_0+(i-1)b, \;x_0+ib\right) \] where \(b\) is the bin width.
Bin width can impact what we can learn from a histogram.
With appropriate scaling of the height of the bins, histograms can be a crude estimate of an unknown probability density \(f\) under which the data was generated.
Similar to (scaled relative frequency) histogram, a kernel density estimate, \(f_{n,h}(t)\), is an estimate of an unknown probability density \(f\) under which the data was generated.
Unlike a histogram, a KDE is smooth. Instead of weighting each datapoint by \(1/n\), a density function called a kernel is used to weight each datapoint.
Intuition: A histogram is like stacking bricks. A KDE is like piling up sand. (Helpful demo 🔗)
Reflects the shape of the sand pile. Choose a shape.
\(K(t)\) is a probability density, that is, \(K(t)\geq 0\) and \(\int_{-\infty}^{\infty} K(t) dt =1\).
\(K(t)\) is symmetric about zero, that is, \(K(t)=K(-t)\).
[often] \(K(t)=0\) for \(|t|>1\).
Reflects the width of the pile of sand.
Scale the kernel by \(h\). \[ t \mapsto \frac{1}{h}K\left( \frac{t}{h}\right) \]
Bandwidth controls the smoothness.
To compute \(f_{n,h}(t)\), shift the scaled kernel to each data point and take the average
\[ f_{n,h}(t)=\frac{1}{n}\left\{ \frac{1}{h} K\left( \frac{t-x_1}{h}\right) + ... + \frac{1}{h} K\left( \frac{t-x_n}{h}\right) \right\} \]
The empirical (cumulative) distribution function, or, eCDF, is defined by \[ F_n(x)=\frac{\big\lvert \{x_i \vert x_i\leq x \} \big\rvert}{n} \]
Consider the dataset: \(\{4,3,9,1,7\}\)
What is the relationship between the eCDF and kernel density estimator?
Suppose you have a histogram and an empirical distribution function \(F_n\) for the same dataset. Derive an expression for the height of the histogram on the bin \((a,b]\) in terms of \(F_n\), \(a\), and \(b\).
So far, we’ve only considered univariate data, that is, data with only one variable.
If we have pairs of datapoints, \(\left\{(x_1,y_1),...,(x_n,y_n)\right\}\), the data is called bivariate.
Scatterplots are used to investigate the relationship between variables.