Probability and Statistics for Computer Science¶

Authors: David Forsyth

ISBN: 978-3-319-64410-3

Notation¶

$\{x\}$ - Dataset
$x_i$ -ith Data Item
$x_i^{(j)}$ - jth Component of ith Data Item
$\text{mean}(\{x\})$ - Mean
$\text{std}(\{x\})$ - Standard Deviation
$\text{var}(\{x\})$ - Variance
$\text{median}(\{x\})$ - Median
$\text{percentile}(\{x\}, k)$ - $k\%$ Percentile
$\text{iqr}(\{x\})$ - Interquartile Range
$\{\hat{x}\}$ - Dataset Transformed to Standard Coordinates
$\text{corr}(\{(x, y)\})$ - Correlation
$\emptyset$ - Empty Set
$\Omega$ - Set of All Possible Experiment Outcomes
$\mathcal{A}$ - Set
$\mathcal{A}^c = \Omega - \mathcal{A}$ - Set Complement
$\mathcal{E}$ - Event
$P(\mathcal{E})$ - Probability of Event $\mathcal{E}$
$P(\mathcal{E} \vert \mathcal{F})$ - Probability of Event $\mathcal{E}$, Conditioned on Event $\mathcal{F}$
$p(x)$ - Probability That Random Variable $X$ Equals Value $x$
$p(x, y)$ - Probability That Random Variable $X$ Equals Value $x$ And Random Variable $Y$ Equals Value $y$
$\max_{x} f(x)$ - Value of $x$ That Maximizes $f(x)$
$\min_{x} f(x)$ - Value of $x$ That Minimizes $f(x)$
$\hat{\theta}$ - Estimated Value of $\theta$

1. First Tools for Looking at Data¶

Datasets¶

Dataset: A collection of descriptions (or $d$-tuples) of different instances of the same phenomenon.
- Categorical
- Continuous

Summarizing 1D Data¶

A location parameter tells where the data lies along a number line.
A scale parameter tells how wide the spread of data is.

Mean¶

Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The mean of this dataset is: $$\text{mean}(\{x\}) = \frac{1}{N} \sum_{i = 1}^{N} x_i$$

Properties of Mean¶

Scaling: $\text{mean}(\{k \cdot x_i\}) = k \cdot \text{mean}(\{x_i\})$
- Yes Effect
Translation: $\text{mean}(\{x_i + c\}) = \text{mean}(\{x_i\}) + c$
- Yes Effect
Sum of Signed Differences: $\sum_{i = 1}^{N} \left( x_i - \text{mean}(\{x_i\}) \right) = 0$
Sum of Squared Distances to $\mu$: $\min_{\mu} \sum_{i} \left( x_i - \mu \right)^2 = \text{mean}(\{x_i\})$
- Mean: $\mu$

Interpretation of Mean¶

The mean is a location parameter that summarizes the dataset with a value that is as close as possible to each datum.

Standard Deviation¶

Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The standard deviation of this dataset is: $$ \begin{align} \text{std}(\{x\}) & = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)^2} \\ & = \sqrt{\text{mean}\left( \{ \left( x_i - \text{mean}(\{x\}) \right)^2 \} \right)} \end{align} $$

Properties of Standard Deviation¶

Scaling: $\text{std}(\{k \cdot x_i\}) = k \cdot \text{std}(\{x_i\})$
- Yes Effect
Translation: $\text{std}(\{x_i + c\}) = \text{std}(\{x_i\})$
- No Effect
For any dataset, there can be only a few items that are many standard deviations away from the mean. For $N$ data items, $x_i$, whose standard deviation is $\sigma$, there are at most $\frac{1}{k^2}$ data points lying $k$ or more standard deviations away from the mean.
For any dataset, there must be at least one data item that is at least one standard deviation away from the mean.

Interpretation of Standard Deviation¶

The standard deviation is the root mean square of the offsets from the mean.
The standard deviation is a scale parameter that measures the size of the average deviation from the mean for a dataset.
When the standard deviation is large, there are many items with values much larger than, or much smaller than, the mean.
When the standard deviation is small, most data items have values close to the mean.

Variance¶

Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The variance of this dataset is: $$ \begin{align} \text{var}(\{x\}) & = \frac{1}{N} \sum_{i = 1}^{N} \left( x_i - \text{mean}(\{x\}) \right)^2 \\ & = \text{mean}\left( \{ \left( x_i - \text{mean}(\{x\}) \right)^2 \} \right) \end{align} $$

Properties of Variance¶

Scaling: $\text{var}(\{k \cdot x_i\}) = k^2 \cdot \text{var}(\{x_i\})$
- Yes Effect
Translation: $\text{var}(\{x_i + c\}) = \text{var}(\{x_i\})$
- No Effect

Interpretation of Variance¶

Variance is the mean-square error you would incur if you replaced each data item with the mean.

Median¶

Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$.
- If $N$ is odd, the median of this dataset is: $$\text{median}(\{x\}) = \text{sort}(\{x\})\left[ \frac{N}{2} \right]$$
- If $N$ is even, the median of this dataset is: $$\text{median}(\{x\}) = \frac{1}{2} \left( \text{sort}(\{x\})\left\lceil \frac{N}{2} \right\rceil + \text{sort}(\{x\})\left\lfloor \frac{N}{2} \right\rfloor \right)$$

Properties of Median¶

Scaling: $\text{median}(\{k \cdot x_i\}) = k \cdot \text{median}(\{x_i\})$
- Yes Effect
Translation: $\text{median}(\{x_i + c\}) = \text{median}(\{x_i\}) + c$
- Yes Effect

Interpretation of Median¶

Generally, approximately half the data is smaller than the median, and approximately half the data is larger than the median.
The median is an alternative to the mean because it is also a location parameter.

Interquartile Range¶

Percentile: The $k$'th percentile is the value such hat $k\%$ of the data is less than or equal to that value.
- $\text{percentile}(\{x\}, k)$
Quartile:
- The first quartile is the value such that $25\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 25)$.
- The second quartile is the value such that $50\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 50)$.
- The third quartile is the value such that $75\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 75)$.
Interquartile Range: The interquartile range of a dataset $\{x\}$ is: $$\text{iqr}(\{x\}) = \text{percentile}(\{x\}, 75) - \text{percentile}(\{x\}, 25)$$

Properties of Interquartile Range¶

Scaling: $\text{iqr}(\{k \cdot x_i\}) = k \cdot \text{iqr}(\{x_i\})$
- Yes Effect
Translation: $\text{iqr}(\{x_i + c\}) = \text{iqr}(\{x_i\})$
- No Effect

Interpretation of Interquartile Range¶

The interquartile range is an alternative to the standard deviation because it is also a scale parameter.

Online Algorithms for Mean and Standard Deviation¶

Mean¶

Let $\hat{\mu}_{k}$ be an estimate for the mean of the dataset after seeing $k$ elements.

$$ \begin{align} \hat{\mu}_{1} &= x_1 \\ \hat{\mu}_{k + 1} &= \frac{k \cdot \hat{\mu}_{k} + x_{k + 1}}{k + 1} \end{align} $$

Standard Deviation¶

Let $\hat{\sigma}_{k}$ be an estimate for the standard deviation of the dataset after seeing $k$ elements.

$$ \begin{align} \hat{\sigma}_{1} &= 0 \\ \hat{\sigma}_{k + 1} &= \sqrt{\frac{(k \cdot \hat{\sigma}_{k}^2) + (x_{k + 1} - \hat{\mu}_{k + 1})^2}{k + 1}} \end{align} $$

Mean vs. Median and Standard Deviation vs. Interquartile Range¶

Mean and Standard Deviation¶

The mean and the standard deviation are strongly affected by outliers.
The mean and the standard deviation are inexpensive to exactly calculate.
Generally, The mean and the standard deviation are sensible for continuous data.

Median and Interquartile Range¶

The median and the interquartile range are weakly affected by outliers.
The median and the interquartile range are expensive to exactly calculate.
Generally, The median and the interquartile range are sensible for categorical data.

Histograms¶

Bar Chart: A set of bars, one per category, where the height of each bar is proportional to the number of items in that category.
Histogram: A generalization of a bar chart for continuous-valued data.
1. Divide the range of data into even or uneven intervals.
2. Associate each interval with a pigeonhole.
3. Associate each datum with a pigeonhole.
4. Visualize the histogram as a set of boxes, one per interval, in which each box sits on its interval on the horizontal axis, and its height is determined by the amount of data in the corresponding pigeonhole.
Conditional Histogram: A histogram that only plots part of a data set.

Modes and Histograms¶

A histogram is unimodal if there is only one peak.
A histogram is bimodal if there is only two peaks.
A histogram is multimodal if there are many peaks.

Modes of Histogram

Skew and Histograms¶

The tails of a histogram are the relative uncommon values that are significantly larger or smaller than the value at the peak.
If the histogram is not symmetric, then the histogram is skewed.

Skew of Histogram

Standard Coordinates and Normal Data¶

Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$, The standard coordinates of this dataset is: $$\hat{x}_i = \frac{x_i - \text{mean}(\{x\})}{\text{std}(\{x\})}$$
Data is standard normal data if, when we have a lot of data, the histogram of the data in standard coordinates is a close approximation to the standard normal curve: $$y(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2 / 2}$$
Data is normal data if, when we subtract the mean and divide by the standard deviation, it becomes the standard normal data.

Interpretation of Standard Coordinates¶

A dataset expressed in standard coordinates is unitless with a mean of $0$ and a standard deviation of $1$.
- This allows different datasets to be compared if they are expressed in standard coordinates.
Many datasets expressed in standard coordinates are symmetric and unimodal, so they tend to be normal data.

Standard Normal Curve¶

Standard Normal Curve

Properties of Normal Data¶

Approximately 68% of data lie within one standard deviations of the mean.
Approximately 95% of data lie within two standard deviations of the mean.
Approximately 99% of data lie within three standard deviations of the mean.

Box Plots¶

A box plot is a way to plot data that simplifies comparison.
- Dataset = Vertical Display.
- Vertical Box = Interquartile Range.
- Horizontal Line = Median.
- Whiskers = Range of Non-Outlier Data.
- Crosses = Outliers.

Box Plot

2. Looking at Relationships¶

Plotting 2D Data¶

Pie Chart
Heat Map
Stacked Bar Chart
3D Bar Chart
Series

Scatter Plots¶

Scatter Plots: A most effective tool for geographic and 2D data in general.
- A scatter plot should be your first step with a new 2D dataset.
- The plot scale can mask effects in scatter plots, and it's usually a good idea to plot in standard coordinates.
- Any $d$-dimensional vector can be projected into 2D.

Correlation¶

Correlation: Relationship
- Positive: When larger $\hat{x}$ values tend to appear with larger $\hat{y}$ values, and vice versa.
- Negative: When larger $\hat{x}$ values tend to appear with smaller $\hat{y}$ values, and vice versa.
- Zero: When there is no relationship between $\hat{x}$ values and $\hat{y}$ values.
  - Mean in Standard Coordinates: $\text{mean}(\{\hat{x}\}) = 0$ and $\text{mean}(\{\hat{y}\}) = 0$
  - Variance in Standard Coordinates: $\text{var}(\{\hat{x}\}) = 1$ and $\text{var}(\{\hat{y}\}) = 1$

Correlations

Correlation Coefficient¶

Assume we have $N$ data items which are $2$-vectors $(x_1, y_1), ..., (x_N, y_N)$. The correlation coefficient is the mean value of $\hat{x} \hat{y}$: $$\text{corr}(\{(x, y)\}) = \frac{\sum_{i} \hat{x}_i \hat{y}_i}{N}$$
- Where $\hat{x}_i$ and $\hat{y}_i$ are in standard coordinates.

Properties of Correlation Coefficient¶

Symmetric: $\text{corr}(\{(x, y)\}) = \text{corr}(\{(y, x)\})$
Scaling and Translation: $\text{corr}(\{(ax + b, cy + d)\}) = \text{sign}(ab)\text{corr}(\{(x, y)\})$
- Scaling: Changes Sign.
- Translation: No Effect.
Maximum Value: $1$ when $\hat{x} = \hat{y}$.
Minimum Value: $-1$ when $\hat{x} = -\hat{y}$.
If $\hat{y}$ tends to be large for large values of $\hat{x}$, then the correlation coefficient will be positive, and vice versa.
If $\hat{y}$ tends to be small for large values of $\hat{x}$, then the correlation coefficient will be negative, and vice versa.
If $\hat{y}$ does not depend on $\hat{x}$, then the correlation coefficient is close to zero.

Interpretation of Correlation Coefficient¶

The correlation coefficient is a measure of the ability to predict a $x$ given a $y$ and vice versa.
The correlation coefficient ranges from $-1$ to $1$.
Large Correlation Coefficient ($1.0$): Strong Predictions.
Small Correlation Coefficient ($<0.5$): Weak Predictions.

Predicting a Value Using Correlation¶

Assume we have $N$ data items which are $2$-vectors $(x_1, y_1), ..., (x_N, y_N)$.

Predicting $y_0$ Given $x_0$¶

Transform the dataset into standard coordinates: $\hat{x}_i$, $\hat{y}_i$, and $\hat{x}_0$.
Compute the correlation coefficient: $r = \text{corr}(\{(x, y)\}) = \text{mean}(\{\hat{x} \hat{y}\})$
Predict $\hat{y}_0 = r \cdot \hat{x}_0$.
Transform this prediction into the original coordinate system: $y_0 = \text{std}(\{y\}) \cdot r \cdot \hat{x}_0 + \text{mean}(\{y\})$

Predicting $x_0$ Given $y_0$¶

Transform the dataset into standard coordinates: $\hat{x}_i$, $\hat{y}_i$, and $\hat{y}_0$.
Compute the correlation coefficient: $r = \text{corr}(\{(x, y)\}) = \text{mean}(\{\hat{x} \hat{y}\})$
Predict $\hat{x}_0 = r \cdot \hat{y}_0$.
Transform this prediction into the original coordinate system: $x_0 = \text{std}(\{x\}) \cdot r \cdot \hat{y}_0 + \text{mean}(\{x\})$

Notes on Predictions with Correlation¶

Root Mean Square Error: $\sqrt{1 - r^2}$.
If $x_0$ is $k$ standard deviations from the mean of $x$, then the predicted value of $y$ will be $rk$ standard deviations away from the mean of $y$, and the sign of $r$ tells whether $y$ increases or decreases.
The predicted value of $y$ increases by $r$ standard deviations when the value of $x$ increases by one standard deviation.

3. Basic Ideas in Probability¶

Counting¶

Product Rule¶

Suppose that a procedure can be broken down into a sequence of two tasks.
If there are $n_1$ ways to do the first task and for each of these ways of doing the first task, there are $n_2$ ways to do the second task, then there are $n_1 \cdot n_2$ ways to do the procedure.

Sum Rule¶

If a task can be done in either one of $n_1$ ways or in one of $n_2$ ways such that the set of $n_1$ ways and the set of $n_2$ ways are disjoint, then there are $n_1 + n_2$ ways to do the task.

Permutations (Ordering)¶

Elements without Repetition: $\text{permutation}(n, r) = \frac{n!}{(n - r)!}$
Elements with Repetition: $\text{permutation}(n, r) = n^r$

Combinations (No Ordering)¶

$\text{combinations}(n, r) = \frac{n!}{r! (n - r)!}$
$\text{combinations}(n, r) = \text{combinations}(n, n - r)$
$\text{combinations}(n, r) = \text{combinations}(n - 1, r - 1) + \text{combinations}(n - 1, r)$

Binomial Theorem¶

$(x + y)^n = \sum_{i = 0}^{n} \text{combinations}(n, i) \cdot x^{n - i} \cdot y^i$

Outcomes¶

Outcome: The result from a run of an experiment.
Sample Space: The set of all outcomes, denoted by $\Omega$.
Sample spaces are required, and need not be finite.
The probability of an outcome is the frequency of that outcome in a very large number of repeated experiments. The sum of probabilities over all outcomes must be one.

Events and Probability¶

Event: A set of outcomes, denoted by $\mathcal{E}$. $$P(\mathcal{E}) = \frac{\lvert \mathcal{E} \rvert}{\lvert \Omega \rvert}$$

Properties of Events¶

Union: $\mathcal{E} \cup \mathcal{F}$
Intersection: $\mathcal{E} \cap \mathcal{F}$
Compliment: $\mathcal{E}^c$
Subset: $\mathcal{E} \subset \mathcal{F}$
Commutative: $\mathcal{E} \cup \mathcal{F} = \mathcal{F} \cup \mathcal{E}$ and $\mathcal{E} \cap \mathcal{F} = \mathcal{F} \cap \mathcal{E}$
Associative: $(\mathcal{E} \cup \mathcal{F}) \cup \mathcal{G} = \mathcal{E} \cup (\mathcal{F} \cup \mathcal{G})$ and $(\mathcal{E} \cap \mathcal{F}) \cap \mathcal{G} = \mathcal{E} \cap (\mathcal{F} \cap \mathcal{G})$
Distributive: $(\mathcal{E} \cup \mathcal{F}) \cap \mathcal{G} = (\mathcal{E} \cap \mathcal{G}) \cup (\mathcal{F} \cap \mathcal{G})$ and $(\mathcal{E} \cap \mathcal{F}) \cup \mathcal{G} = (\mathcal{E} \cup \mathcal{G}) \cap (\mathcal{F} \cup \mathcal{G})$
De Morgan's Laws: $(\cup_{i = 1}^{n} \mathcal{E}_i)^c = \cap_{i = 1}^{n} E_i^c$ and $(\cap_{i = 1}^{n} \mathcal{E}_i)^c = \cup_{i = 1}^{n} E_i^c$

Properties of the Probability of Events¶

The probability of every event is between zero and one: $0 \le P(\mathcal{A}) \le 1$
Every experiment has an outcome: $P(\Omega) = 1$
The probability of disjoint events is additive: $P(\cup_{i} \mathcal{A}_i) = \sum_{i} P(\mathcal{A}_i)$
Complement: $P(\mathcal{A}^c) = 1 - P(\mathcal{A})$
Empty Set: $P(\emptyset) = 0$
Set Difference: $P(\mathcal{A} - \mathcal{B}) = P(\mathcal{A}) - P(\mathcal{A} \cap \mathcal{B})$
Set Union: $P(\mathcal{A} \cup \mathcal{B}) = P(\mathcal{A}) + P(\mathcal{B}) - P(\mathcal{A} \cap \mathcal{B})$
Generic Union: $P(\cup_{i} \mathcal{A}_i) = \sum_{i} P(\mathcal{A}_i) - \sum_{i < j} P(\mathcal{A}_i \cap \mathcal{A}_j) + \sum_{i < j < k} P(\mathcal{A}_i \cap \mathcal{A}_j \cap \mathcal{A}_k) + ... (-1)^{n + 1} P(\mathcal{A}_1 \cap \mathcal{A}_2 \cap ... \cap \mathcal{A}_n)$

Conditional Probability¶

Assume we have a space of outcomes and a collection of events. The conditional probability of $\mathcal{B}$, conditioned on $\mathcal{A}$, is the probability that $\mathcal{B}$ occurs given that $\mathcal{A}$ has definitely occured. $$ \begin{align} P(\mathcal{B} \vert \mathcal{A}) &= \frac{P(\mathcal{B} \cap \mathcal{A})}{P(\mathcal{A})} \\ &= \frac{P(\mathcal{A} \vert \mathcal{B}) \cdot P(\mathcal{B})}{P(\mathcal{A})} \end{align} $$

Conditional Probability Formulas¶

$P(\mathcal{A}) = P(\mathcal{A} \cap \mathcal{B}) + P(\mathcal{A} \cap \mathcal{B}^c)$
$P(\mathcal{A}) = P(\mathcal{A} \vert \mathcal{B}) \cdot P(\mathcal{B}) + P(\mathcal{A} \vert \mathcal{B}^c) \cdot P(\mathcal{B}^c)$
If $\mathcal{B}_1, \mathcal{B}_2, ..., \mathcal{B}_n$ are mutually exclusive events and $\mathcal{A} = \cup_{i = 1}^{n} (\mathcal{A} \cap \mathcal{B}_i)$,
- $P(\mathcal{A}) = \sum_{i = 1}^{n} P(\mathcal{A} \vert \mathcal{B}_i) \cdot P(\mathcal{B}_i)$
- $P(\mathcal{B}_i \vert \mathcal{A}) = \frac{P(\mathcal{A} \vert \mathcal{B}_i) \cdot P(\mathcal{B}_i)}{\sum_{i = 1}^{n} P(\mathcal{A} \vert \mathcal{B}_i) \cdot P(\mathcal{B}_i)}$

Independence¶

Two events $\mathcal{A}$ and $\mathcal{B}$ are independent if and only if $$ \begin{align} P(\mathcal{A} \cap \mathcal{B}) &= P(\mathcal{A})P(\mathcal{B}) \\ P(\mathcal{A} \vert \mathcal{B}) &= P(\mathcal{A}) \\ P(\mathcal{B} \vert \mathcal{A}) &= P(\mathcal{B}) \end{align} $$
Generally, the probability of a sequence of independent events can become very small, very quickly.
Therefore, modelling events that are not independent as independent can be a mistake.

Pairwise Independence¶

Events $\mathcal{A}_1, ..., \mathcal{A}_n$ are pairwise independent if each pair is independent.

Conditional Independence¶

Events $\mathcal{A}_1, ..., \mathcal{A}_n$ are conditionally independent conditioned on event $\mathcal{B}$ if $$P(\mathcal{A}_1 \cap ... \cap \mathcal{A}_n \vert \mathcal{B}) = P(\mathcal{A}_1 \vert \mathcal{B}) ... P(\mathcal{A}_n \vert \mathcal{B})$$

Fallacies¶

Gambler's Fallacy¶

When you reason that the probability of an independent event has been changed by previous outcomes.
- e.g., If a fair coin is tossed $25$ times, and the $25$ outcomes are heads, the probability that the next toss will result in a head has not changed at all.

Prosecutor's Fallacy¶

A prosecutor has evidence $\mathcal{E}$ against a suspect.
Let $\mathcal{I}$ be the event that the suspect is innocent.
When $P(\mathcal{E} \vert \mathcal{I})$ is small, the prosecutor argues, incorrectly, that the suspect must be guilty, because $P(\mathcal{E} \vert \mathcal{I})$ is so small.

The argument is incorrect because $P(\mathcal{E} \vert \mathcal{I})$ is irrelevant to the issue; instead, $P(\mathcal{I} \vert \mathcal{E})$ is relevant.
Note: $P(\mathcal{I} \vert \mathcal{E})$ can be large even if $P(\mathcal{E} \vert \mathcal{I})$ is small.

4. Random Variables and Expectations¶

Random Variables¶

Given a sample space $\Omega$, a set of events $\mathcal{F}$, a probability function $P$, and a countable set of real numbers $D$, a discrete random variable is a function with domain $\Omega$ and range $D$.
A function whose argument is a discrete random variable to a set of numbers is also a discrete random variable.

Probability Distribution of a Discrete Random Variable¶

The probability distribution (or probability mass function) of a discrete random variable is the set of numbers $P(\{X = x\})$ for each value $x$ that $X$ can take.

Cumulative Distribution of a Discrete Random Variable¶

The cumulative distribution (or probability cumulative distribution function) of a discrete random variable is the set of numbers $P(\{X \le x\})$ for each value $x$ that $X$ can take.

Joint and Conditional Probability for Random Variables¶

Assume we have two random variables $X$ and $Y$. The probability that $X$ takes the value $x$ and $Y$ takes the value $y$,the joint probability distribution is $P(x, y) = P(\{X = x\} \cap \{Y = y\})$.
- i.e., Table of Probabilities per $x$ and $y$ Pairs

Bayes' Rule¶

$$P(x \vert y) = \frac{P(y \vert x) \cdot P(x)}{P(y)}$$

Marginal Probability of a Random Variable¶

$$P(x) = \sum_{y} P(x, y) = \sum_{y} P(\{X = x\} \cap \{Y = y\}) = P(\{X = x\})$$

Independent Random Variables¶

The random variables $X$ and $Y$ are independent if the events $\{X = x\}$ and $\{Y = y\}$ are independent for all values $x$ and $y$. $$P(x, y) = P(x)P(y)$$

Continuous Probability¶

A continuous random variable has a probability density function: $p(x)$. $$ \begin{align} P(\{X \in [a, b]\}) &= \int_{a}^{b} p(x) dx \\ P(\{X \in [-\infty, +\infty]\}) &= \int_{-\infty}^{+\infty} p(x) dx = 1 \end{align} $$
If $g(x)$ is a non-negative function that is proportional to the probability density function $p(x)$, $p(x)$ can be recovered by normalization: $$p(x) = \frac{1}{\int_{-\infty}^{+\infty} g(x) dx} g(x)$$

Interpretation of Continuous Probability¶

A probability density function can be interpreted as the limites of a histogram whose intervals are arbitrarily narrow and whose area is one.

Expected Values¶

Discrete Expected Value¶

Given a discrete random variable $X$ which takes values in the set $\mathcal{D}$ and which has probability distribution $P$, the expected value is: $$\mathbb{E}[X] = \sum_{x \in \mathcal{D}} x \cdot P(X = x)$$

Discrete Expectation¶

Assume we have a function $f$ that maps a discrete random variable $X$ into a set of numbers $\mathcal{D}_f$. Then $F = f(X)$ is a discrete random variable. The expected value of $F$ is: $$\mathbb{E}[f] = \sum_{u \in \mathcal{D}_f} u \cdot P(F = u) = \sum_{x \in \mathcal{D}} f(x) \cdot P(X = x)$$

Continuous Expected Value¶

Given a continuous random variable $X$ which takes values in the set $\mathcal{D}$ and which has probability distribution $P$, the expected value is: $$\mathbb{E}[X] = \int_{x \in \mathcal{D}} x \cdot p(x) dx$$

Continuous Expectation¶

Assume we have a function $f$ that maps a continuous random variable $X$ into a set of numbers $\mathcal{D}_f$. Then $F = f(X)$ is a discrete random variable. The expected value of $F$ is: $$\mathbb{E}[f] = \int_{x \in \mathcal{D}} f(x) \cdot p(x) dx$$

Properties of Expectations¶

Let $f$ and $g$ be functions of random variables, and $k$ be a constant. Expectations have linearity.
- $\mathbb{E}[0] = 0$
- $\mathbb{E}[k \cdot f] = k \cdot \mathbb{E}[f]$
- $\mathbb{E}[f + g] = \mathbb{E}[f] + \mathbb{E}[g]$

Expectations with Mean, Variance and Covariance¶

Mean: $\text{mean}(X) = \mathbb{E}[X]$
Variance: $\text{var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$
Standard Deviation: $\text{std}(X) = \sqrt{\text{var}(X)}$
Covariance: $\text{cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$

Properties of Variance¶

For any constant $k$, $\text{var}(k) = 0$
$\text{var}(X) \ge 0$
$\text{var}(k \cdot X) = k^2 \cdot \text{var}(X)$
If $X$ and $Y$ are independent, then $\text{var}(X + Y) = \text{var}(X) + \text{var}(Y)$

Facts of Covariance¶

If $X$ and $Y$ are independent, then $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$
If $X$ and $Y$ are independent, then $\text{cov}(X, Y) = 0$
$\text{var}(X) = \text{cov}(X, X)$

IID Samples¶

Assume we have a set of data items $x_i$ such that:
1. They are independent;
2. The histogram of a very large set of data items looks increasingly like the probability distribution $P(X)$ as the number of data items increases.
Then, these data items are independent identically distributed samples of $P(X)$.

Markov's Inequality¶

$$P(\{\lvert X \rvert \ge a\}) \le \frac{\mathbb{E}[\lvert X \rvert]}{a}$$

Interpretation of Markov's Inequality¶

Markov's Inequality relates probability cumulative distribution functions to expectations by establishing an upper bound.

Chebyshev's Inequality¶

$$ \begin{align} P(\{\lvert X - \mathbb{E}[X] \rvert \ge a\}) &\le \frac{\text{var}(X)}{a^2} \\ P(\{\lvert X - \mathbb{E}[X] \rvert \ge k\sigma\}) &\le \frac{1}{k^2} \end{align} $$

Interpretation of Chebyshev's Inequality¶

Chebyshev's Inequality states that the probability of a random variable being at least $k$ standard deviations from the mean must be at most $\frac{1}{k^2}$.

Indicator Functions¶

An indicator function for an event is a function that takes the value zero for values of $x$ where the event does not occur, and one where the event occurs. $$\mathbb{I}_{\lvert \mathcal{E} \rvert}$$
An indicator function for an event has an expectation equivalent to the probability of the event. $$\mathbb{E}[\mathbb{I}_{\lvert \mathcal{E} \rvert}] = P(\mathcal{E})$$

Weak Law of Large Numbers¶

Assume a set of $N$ IID samples $x_i$ of a probability distribution $P(X)$. $X_N$ is a random variable of the IID samples. $$\mathbb{E}[X_N] = \frac{\sum_{i = 1}^{N} x_i}{N}$$
If $P(X)$ has finite variance, then for any positive number $\epsilon$, $$ \begin{align} \lim_{N \to \infty} P(\{\lvert X_N - \mathbb{E}[X] \rvert \ge \epsilon \}) &= 0 \\ \lim_{N \to \infty} P(\{\lvert X_N - \mathbb{E}[X] \rvert < \epsilon \}) &= 1 \end{align} $$

Implications of Weak Law of Large Numbers¶

Assume a random variable $X$. Then, the weak law of large numbers state that if a large number of IID samples of the random variable is observed, the average of these IID samples should be very close to $\mathbb{E}[X]$.
As the weak law of large numbers allow expectations to be estimated, PROBABILITIES follow as they are expectations of indicator functions.
THUS, EXPECTATIONS CAN BE USED IN BUILDING A THEORY OF DECISION MAKING!

5. Useful Probability Distributions¶

Motivations of Probability Distributions¶

Model Building
1. What process produced the data?
2. What sort of data can we expect in the future?
3. What labels should we attach to unlabelled data?
4. Is an effect easily explained by chance variations, or is it real?

Discrete Uniform Distribution¶

See Also: Wikipedia

A discrete uniform random variable, $X$, takes values from $a$ until $b$ with the same probability $\frac{1}{b - a}$.

Properties of Discrete Uniform Distribution¶

Parameters: $a$, $b$
- Where $a < b$
PMF: $$P(\{X = x\}) = \frac{1}{b - a}$$
CDF: $$P(\{X \le x\}) = \frac{x - a}{b - a}$$
Mean: $$\text{mean}(X) = \frac{a + b}{2}$$
Variance: $$\text{var}(X) = \frac{(b - a)^2}{12}$$

Discrete Bernoulli Distribution¶

See Also: Wikipedia

A discrete Bernoulli random variable, $X$, is the outcome from a single experiment from which this outcome is classified as either a success, $X = 1$ with probability $p$, or a failure, $X = 0$ with probability $1 - p$.

Properties of Discrete Bernoulli Distribution¶

Parameters: $p$
- Where $0 \le p \le 1$
- Where $p$ is the probability of the trial's success
PMF: $$ P(\{X = x\}) = \begin{cases} 1 - p & \text{if } x = 0 \\ p & \text{if } x = 1 \end{cases} $$
CDF: $$ P(\{X \le x\}) = \begin{cases} 0 & \text{if } x < 0\\ 1 - p & \text{if } 0 \le x < 0 \\ p & \text{if } x \ge 1 \end{cases} $$
Mean: $$\text{mean}(X) = p$$
Variance: $$\text{var}(X) = p(1 - p)$$

Discrete Binomial Distribution¶

See Also: Wikipedia

A discrete binomial random variable, $X$, is the number of successful outcomes from a sequence of $n$ independent experiments in which each experiment has an outcome classified as either a success with probability $p$ or a failure with probability $1 - p$

Properties of Discrete Binomial Distribution¶

Parameters: $n$, $p$
- Where $n \ge 0$
- Where $0 \le p \le 1$
- Where $n$ is the number of trials
- Where $p$ is the probability of each trial's success
PMF: $$P(\{X = x\}) = \binom{n}{x} p^x (1 - p)^{n - x}$$
CDF: $$P(\{X \le x\}) = \sum_{i = 0}^{x} \binom{n}{i} p^i (1 - p)^{n - i}$$
Mean: $$\text{mean}(X) = np$$
Variance: $$\text{var}(X) = np(1 - p)$$

Discrete Poisson Distribution¶

See Also: Wikipedia

A discrete Poisson random variable, $X$, is the number of events occuring in a fixed interval of time or at a fixed rate.
A discrete random variable is approximated by a discrete binomial random variable where $n$ is large and $p$ is small such that $\lambda = np$ is moderate.

Properties of Discrete Poisson Distribution¶

Parameters: $\lambda$
- Where $\lambda > 0, \lambda \in \mathbb{R}$
- Where $\lambda$ is the occurrence rate
PMF: $$P(\{X = x\}) = e^{-\lambda} \frac{\lambda^x}{x!}$$
CDF: $$P(\{X \le x\}) = e^{-\lambda} \sum_{i = 0}^{x} \frac{\lambda^i}{i!}$$
Mean: $$\text{mean}(X) = \lambda$$
Variance: $$\text{var}(X) = \lambda$$

Discrete Geometric Distribution¶

See Also: Wikipedia

A discrete geometric random variable, $X$, is the number of Bernoulli trials with probability $p$ needed to get one success.

Properties of Discrete Geometric Distribution¶

Parameters: $p$
- Where $0 \le p \le 1$
- Where $p$ is the probability of each trial's success
PMF: $$P(\{X = x\}) = (1 - p)^{x - 1} p$$
CDF: $$P(\{X \le x\}) = 1 - (1 - p)^x$$
Mean: $$\text{mean}(X) = \frac{1}{p}$$
Variance: $$\text{var}(X) = \frac{1 - p}{p^2}$$

Discrete Negative Binomial Distribution¶

See Also: Wikipedia

A discrete negative binomial random variable, $X$, is the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures.

Properties of Discrete Negative Binomial Distribution¶

Parameters: $r$, $p$
- Where $r > 0$
- Where $0 \le p \le 1$
- Where $r$ is the number of failures until the trials are stopped.
- Where $p$ is the probability of the trial's success
PMF: $$P(\{X = x\}) = \binom{x + r - 1}{x} p^x (1 - p)^r$$
CDF: $$P(\{X \le x\}) = \sum_{i = 0}^{x} \binom{i + r - 1}{i} p^i (1 - p)^r$$
Mean: $$\text{mean}(X) = \frac{pr}{1 - p}$$
Variance: $$\text{var}(X) = \frac{pr}{(1 - p)^2}$$

Discrete Hypergeometric Distribution¶

See Also: Wikipedia

A discrete hypergeometric random variable, $X$, is the number of successes (or random draws for which the object drawn has a specified feature) in $n$ draws, without replacement, from a finite population of size $N$ that contains exactly $K$ objects wth that specified feature.

Properties of Discrete Hypergeometric Distribution¶

Parameters: $N$, $n$, $K$
- Where $N \ge 0$
- Where $0 \le n \le N$
- Where $0 \le K \le N$
- Where $N$ is the size of the population
- Where $n$ is the number of objects drawn
- Where $K$ is the number of objects with the specified feature
PMF: $$P(\{X = x\}) = \frac{\binom{K}{x} \binom{N - K}{n - x}}{\binom{N}{n}}$$
Mean: $$\text{mean}(X) = n \frac{K}{N}$$
Variance: $$\text{var}(X) = n \frac{K}{N} \frac{N - K}{N} \frac{N - n}{N - 1}$$

Continuous Uniform Distribution¶

See Also: Wikipedia

Continuous Uniform Distribution

Properties of Continuous Uniform Distribution¶

Parameters: $a$, $b$
- Where $a < b$
PDF: $$p(x) = \frac{1}{b - a}$$
CDF: $$F(x) = \frac{x - a}{b - a}$$
Mean: $$\text{mean}(X) = \frac{a + b}{2}$$
Variance: $$\text{var}(X) = \frac{(b - a)^2}{12}$$

Continuous Normal Distribution¶

See Also: Wikipedia

Continuous Normal Distribution

Properties of Continuous Normal Distribution¶

Parameters: $\mu$, $\sigma^2$
- Where $\mu$ is the mean
- Where $\sigma^2$ is the variance
PDF: $$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$$
CDF: $$ \begin{align} F(x) &= \Phi\left(\frac{x - \mu}{\sigma}\right) \\ \Phi(x) &= \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{x} e^{-\frac{y^2}{2}} dy \end{align} $$
Mean: $$\text{mean}(X) = \mu$$
Variance: $$\text{var}(X) = \sigma^2$$

Continuous Exponential Distribution¶

See Also: Wikipedia

Continuous Exponential Distribution

Properties of Continuous Exponential Distribution¶

Parameters: $\lambda$
- Where $\lambda > 0, \lambda \in \mathbb{R}$
- Where $\lambda$ is the occurrence rate
PDF: $$p(x) = \lambda e^{-\lambda x}$$
CDF: $$F(x) = 1 - e^{-\lambda x}$$
Mean: $$\text{mean}(X) = \frac{1}{\lambda}$$
Variance: $$\text{var}(X) = \frac{1}{\lambda^2}$$

Continuous Weibull Distribution¶

See Also: Wikipedia

Continuous Weibull Distribution

Properties of Continuous Weibull Distribution¶

Parameters: $\lambda$, $k$
- Where $\lambda > 0$
- Where $k > 0$
- Where $\lambda$ is the scale
- Where $k$ is the shape
PDF: $$ p(x) = \begin{cases} \frac{k}{\lambda} \left(\frac{x}{\lambda}\right)^{k - 1} e^{-(x / \lambda)^k} & \text{if } x \ge 0 \\ 0 & \text{if } x < 0 \end{cases} $$
CDF: $$ F(x) = \begin{cases} 1 - e^{-(x / \lambda)^k} & \text{if } x \ge 0 \\ 0 & \text{if } x < 0 \end{cases} $$
Mean: $$\text{mean}(X) = \lambda \Gamma(1 + 1 / k)$$
Variance: $$\text{var}(X) = \lambda^2 \left[\Gamma\left(1 + \frac{2}{k}\right) - \left(\Gamma\left(1 + \frac{1}{k}\right)\right)^2\right]$$

Continuous Cauchy Distribution¶

See Also: Wikipedia

Continuous Cauchy Distribution

Properties of Continuous Cauchy Distribution¶

Parameters: $x_0$, $\gamma$
- Where $\gamma > 0$
- Where $x_0$ is the location
- Where $\gamma$ is the scale
PDF: $$p(x) = \frac{1}{\pi \gamma \left[1 + \left(\frac{x - x_0}{\gamma}\right)^2\right]}$$
CDF: $$F(x) = \frac{1}{\pi} \arctan\left(\frac{x - x_0}{\gamma}\right) + \frac{1}{2}$$
Mean: $$\text{mean}(X) = \text{undefined}$$
Variance: $$\text{var}(X) = \text{undefined}$$
Median: $$\text{median}(X) = x_0$$
IQR: $$\text{iqr}(X) = 2\gamma$$

Continuous Gamma Distribution¶

See Also: Wikipedia

A conjugate prior for many continuous distributions.

Continuous Gamma Distribution

Properties of Continuous Gamma Distribution¶

$\Gamma(x)$ is the gamma function.
- The gamma function is a generalization of a factorial: $$\Gamma(\alpha) = (\alpha - 1)\Gamma(\alpha - 1)$$
$\gamma(s, x)$ is the lower incomplete gamma function
Parameters: $\alpha$, $\beta$
- Where $\alpha > 0$
- Where $\beta > 0$
- Where $\alpha$ is the shape
- Where $\beta$ is the rate
PDF: $$p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$$
CDF: $$F(x) = \frac{1}{\Gamma(\alpha)} \gamma(\alpha, \beta x)$$
Mean: $$\text{mean}(X) = \frac{\alpha}{\beta}$$
Variance: $$\text{var}(X) = \frac{\alpha}{\beta^2}$$

Continuous Beta Distribution¶

See Also: Wikipedia

A conjugate prior for many discrete distributions.

Continuous Beta Distribution

Properties of Continuous Beta Distribution¶

$B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$
Parameters: $\alpha$, $\beta$
- Where $\alpha > 0$
- Where $\beta > 0$
- Where $\alpha$ is the shape
- Where $\beta$ is the shape
PDF: $$p(x) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)}$$
Mean: $$\text{mean}(X) = \frac{\alpha}{\alpha + \beta}$$
Variance: $$\text{var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$

Normal Approximation to Binomial Distribution¶

The DeMoivre-Laplace limit theorem (a special case of the central limit theorem) states that if the discrete binomial random variable is expressed in standard coordinates, then this standard coordinates distribution will converge to the standard normal distribution ($X \sim N(0, 1)$).
Accordingly, if $X$ is a discrete binomial random variable with $n$ number of trials and $p$ probability of each trial's success, then for any $a < b$, as $n \to \infty$: $$P\left(\left\{a \le \frac{X - \mu}{\sigma} \le b\right\}\right) \to \Phi(b) - \Phi(a)$$
- Where $\mu = np$
- Where $\sigma^2 = np(1 - p)$

How Often a Normal Random Variable is How Far from the Mean¶

About $68\%$ of the time, a normal random variable takes a value within one standard deviation of the mean.
About $95\%$ of the time, a normal random variable takes a value within two standard deviations of the mean.
About $99\%$ of the time, a normal random variable takes a value within three standard deviations of the mean.

6. Samples and Populations¶

Sample Mean¶

Assumption: Sampling with Replacement

Properties of Sample and Population Mean¶

The sample mean is a random variable. It is random, because different samples from the population will have different values of the sample mean.
The expected value of this random variable is the population mean.

Expressions for Mean and Variance of the Sample Mean¶

Let $X^{(N)}$ be a random variable for the mean of $N$ samples $x_i$. $$ \begin{align} \mathbb{E}[X^{(N)}] &= \text{popmean}(\{X\}) \\ \text{var}(X^{(N)}) &= \frac{\text{popstd}(\{X\})^2}{N} \\ \text{std}(X^{(N)}) &= \frac{\text{popstd}(\{X\})}{\sqrt{N}} \end{align} $$
If you draw $N$ samples, because the standard deviation of your estimate of the mean is $\frac{\text{popstd}(\{X\})}{\sqrt{N}}$,
1. The more samples you draw, the better your estimate becomes.
2. The estimate improves rather slowly.

Sample Mean and Distributions¶

A population and the sampling process can be replaced by a probability distribution and the drawing of IID samples.
Assume a set of $N$ data items $x_i$ drawn as IID samples from some probability distribution $P(X)$, $$ \begin{align} X^{(N)} &= \frac{\sum_{i} x_i}{N} \\ \mathbb{E}[X^{(N)}] &= \mathbb{E}_{P(X)}[X] \\ \text{var}(X^{(N)}) &= \frac{\text{var}(P(X)}{N} \end{align} $$

Confidence Interval for a Population Mean¶

Choose some fraction $f$; a $f$ confidence interval for a population mean is an interval constructed using the sample mean.
It has the property that for that fraction $f$ of all samples, the population mean will lie inside the interval constructed from each sample's mean.

Centered Confidence Interval for a Population Mean¶

Choose some $0 < \alpha < 0.5$. A $1 - 2\alpha$ centered confidence interval for a population mean is an interval $[a, b]$ constructed using the sample mean.
It has the property that for $\alpha$ of all samples, the population mean is greater than $b$, and for another $\alpha$ of all samples, the population is less than $a$. For all other samples, the population mean will lie inside the interval.

Estimating the Variance of the Sample Mean¶

Variances of Sample Means

If $N$ is large, $$\text{popstd}(\{x\}) = \text{std}(\{x\}) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)^2}$$
If $N$ is small, $$\text{popstd}(\{x\}) = \text{stdunbiased}(\{x\}) = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)^2}$$
Let $X^{(N)}$ be a random variable for the mean of $N$ samples $x_i$. An estimate of the standard deviation of $X^{(N)}$ is: $$\text{stderr}(\{x\}) = \frac{\text{stdunbiased}(\{x\})}{\sqrt{N}}$$
A $\text{popstd}(\{x\}) \approx \text{std}(\{x\})$ approximation is biased because the approximation tends to be slightly too small. $N - 1$ replaces $N$ because $\sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)$ has only $N - 1$ independent numbers (i.e., degrees of freedom).

T-Distribution of the Sample Mean¶

Student's t-distribution is a probability distribution taken from a family, indexed by a number (the degrees of freedom of the distribution).
- If the number of degrees of freedom is large, the distribution is very similar to a normal distribution.
- Else, the tails are somewhat heavier than those of a normal distribution.
The sample mean yields the value of a t-random variable with $N - 1$ degrees of freedom: $$T = \frac{\text{mean}(\{x\}) - \text{popmean}(\{X\})}{\text{stderr}(\{x\})}$$
If $N$ is large enough ($N \ge 30$), the sample mean yields the value of a standard normal random variable, $Z$: $$Z = \frac{\text{mean}(\{x\}) - \text{popmean}(\{X\})}{\text{stderr}(\{x\})}$$

Confidence Intervals for Population Means¶

Assume the sample is large enough so that $\frac{\text{mean}(\{x\}) - \text{popmean}(\{X\})}{\text{stderr}(\{x\})}$ is a standard normal random variable.
For about $68\%$ of samples: $$\text{mean}(\{x\}) - \text{stderr}(\{x\}) \le \text{popmean}(\{X\}) \le \text{mean}(\{x\}) + \text{stderr}(\{x\})$$
For about $95\%$ of samples: $$\text{mean}(\{x\}) - 2\cdot\text{stderr}(\{x\}) \le \text{popmean}(\{X\}) \le \text{mean}(\{x\}) + 2\cdot\text{stderr}(\{x\})$$
For about $99\%$ of samples: $$\text{mean}(\{x\}) - 3\cdot\text{stderr}(\{x\}) \le \text{popmean}(\{X\}) \le \text{mean}(\{x\}) + 3\cdot\text{stderr}(\{x\})$$

Constructing a Centered $1 - 2\alpha$ Confidence Interval for a Population Mean for a Large Sample¶

Draw a sample $\{x\}$ of $N$ items from a population. $$\text{stdunbiased}(\{x\}) = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)^2}$$
Estimate the standard error. $$\text{stderr}(\{x\}) = \frac{\text{stdunbiased}(\{x\})}{\sqrt{N}}$$
If $N$ is large enough, $T$ is a standard normal variable. $$T = \frac{\text{mean}(\{x\}) - \text{popmean}(\{X\})}{\text{stderr}(\{x\})}$$
Compute $b$ such that for a standard normal variable, $P(\{T \ge b\}) = a$.

Confidence Interval¶

$$[\text{mean}(\{x\}) - b\cdot\text{stderr}(\{x\}), \text{mean}(\{x\}) + b\cdot\text{stderr}(\{x\})]$$

Constructing a Centered $1 - 2\alpha$ Confidence Interval for a Population Mean for a Small Sample¶

Draw a sample $\{x\}$ of $N$ items from a population. $$\text{stdunbiased}(\{x\}) = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} \left(x_i - \text{mean}(\{x\}) \right)^2}$$
Estimate the standard error. $$\text{stderr}(\{x\}) = \frac{\text{stdunbiased}(\{x\})}{\sqrt{N}}$$
If $N$ is small, $T$ is a t-random variable. $$T = \frac{\text{mean}(\{x\}) - \text{popmean}(\{X\})}{\text{stderr}(\{x\})}$$
Compute $b$ such that for a t-random variable, $P(\{T \ge b\}) = a$.

Confidence Interval¶

$$[\text{mean}(\{x\}) - b\cdot\text{stderr}(\{x\}), \text{mean}(\{x\}) + b\cdot\text{stderr}(\{x\})]$$

Estimating Standard Error of Any Statistic - The Bootstrap¶

Goal: Estimate the standard error for a statistic $S$ evaluated on a dataset of $N$ items $\{x\}$.

Compute $r$ bootstrap replicates of the dataset. Write the $i$'th replicate $\{x\}_i$. Obtain each by:
1. Building a uniform probability distribution on the numbers $1, ..., N$.
2. Drawing $N$ independent samples from this distribution. Write $s(i)$ for the $i$'th such sample.
3. Building a new dataset $\{x_{s(1)}, ..., x_{s(N)}\}$.
For each replicate, compute $S(\{x\}_i)$.
Compute the statistic $\bar{S}$. $$\bar{S} = \frac{\sum_{i} S(\{x\}_i)}{r}$$
Estimate the standard error for $S$. $$\text{stderr}(\{S\}) = \sqrt{\frac{\sum_{i} \left[ S(\{x\}_i) - \bar{S} \right]^2}{r - 1}}$$

7. The Significance of Evidence¶

Significance¶

The significance of the evidence against a hypothesis can be assessed by finding what fraction of samples would give sample means like the one observed from the evidence if the hypothesis is true.
The test statistic is a random variable calculated from the sample data and used in a hypothesis test.

p-Values¶

The p-value represents the fraction of samples that would give a more extreme value of the test statistic than that observed, if the hypothesis was true.

p-Value and The Null Hypothesis¶

The null hypothesis is a general statement or default position that there is no relationship between two measured phenomena.
A small p-value means that very few samples would display more extreme behavior than what was observed, if the null hypothesis is true.
A small p-value means that, to believe our null hypothesis, we are forced to believe we have an extremely odd sample.
Formally, the p-value is described as an assessment of the significance of the evidence against the null hypothesis.
The p-value is smaller when the evidence against the null hypothesis is stronger; decide a threshold ($0.05\%$) for how small a p-value can be to reject the null hypothesis.

Generalized Tests of Significance¶

Determine a statistic which can be used to test the particular proposition you have in mind.
1. This statistic needs to depend on your data.
2. This statistic needs to depend on your hypothesis.
3. This statistic needs to have a known distribution under sampling variation.
Compute the value of this statistic.
Look at the distribution to determine what fraction of samples would have a more extreme value.
If this fraction is small, the evidence suggests your hypothesis isn't true.

The T-Test of Significance for a Hypothesized Mean¶

Let the initial hypothesis be that the population has a known mean: $\mu$.
Let $\{x\}$ be the sample.
Let $N$ be the sample size.

Compute the sample mean: $\text{mean}(\{x\})$.
Estimate the standard error. $$\text{stderr}(\{x\}) = \frac{\text{stdunbiased}(\{x\})}{\sqrt{N}}$$
Compute the test statistic. $$s = \frac{\mu - \text{mean}(\{x\})}{\text{stderr}(\{x\})}$$
Compute the p-value.

The p-value summarizes the extent to which the data contradicts the hypothesis.
A small p-value implies that, if the hypothesis is true, the sample is very unusual.
The smaller the p-value, the more strongly the evidence contradicts the hypothesis.

Computing a Two-Sided p-Value for a T-Test¶

Evaluate the following: $$p = (1 - f) = 1 - \int_{-\lvert s \rvert}^{\lvert s \rvert} p_t(u; N - 1) du = P(\{S > \lvert s \rvert\}) \cup P(\{S < -\lvert s \rvert\})$$
Where $p_t(u; N - 1)$ is the probability density of a t-distribution.
If $N > 30$, it is enough to replace $p_t$ with the density of a standard normal distribution.

Computing a One-Sided p-Value for a T-Test - Not Recommended¶

Evaluate either of the following: $$ \begin{aligned} p &= P(\{S > \lvert s \rvert\}) \\ p &= P(\{S < -\lvert s \rvert\}) \end{aligned} $$
Where $p_t(u; N - 1)$ is the probability density of a t-distribution.
If $N > 30$, it is enough to replace $p_t$ with the density of a standard normal distribution.

Sums and Differences of Normal Random Variables¶

Let $X_1$ be a normal random variable with mean $\mu_1$ and standard deviation $\sigma_1$.
Let $X_2$ be a normal random variable with mean $\mu_2$ and standard deviation $\sigma_2$.
Let $X_1$ and $X_2$ be independent.

For any constant $c_1 \ne 0$, $c_1 X_1$ is a normal random variable with mean $c_1 \mu_1$ and standard deviation $c_1 \sigma_1$.
For any constant $c_2$, $X_1 + c_2$ is a normal random variable with mean $\mu_1 + c_2$ and standard deviation $\sigma_1$.
$X_1 + X_2$ is a normal random variable with mean $\mu_1 + \mu_2$ and standard deviation $\sqrt{\sigma^2_1 + \sigma^2_2}$.

Testing Whether Two Populations Have the Same Mean, for Known Population Standard Deviations¶

The initial hypothesis is that the populations have the same, unknown, mean.
Let $\{x\}$ be the sample of the first population.
Let $\{y\}$ be the sample of the second population.
Let $k_x$ be the sample size of the first population.
Let $k_y$ be the sample size of the second population.

Compute the sample means for each population, $\text{mean}(\{x\})$ and $\text{mean}(\{y\})$.
Compute the standard error for the difference between the means. $$s_{ed} = \sqrt{\frac{\text{popsd}(\{X\})}{k_x} + \frac{\text{popsd}(\{Y\})}{k_y}}$$
Compute the value of the test statistic. $$s = \frac{\text{mean}(\{x\}) - \text{mean}(\{y\})}{s_{ed}}$$
Compute the p-value using the normal distribution.

Testing Whether Two Populations Have the Same Mean, for Same, Unknown Population Standard Deviations¶

The initial hypothesis is that the populations have the same, unknown, mean.
Let $\{x\}$ be the sample of the first population.
Let $\{y\}$ be the sample of the second population.
Let $k_x$ be the sample size of the first population.
Let $k_y$ be the sample size of the second population.

Compute the sample means for each population, $\text{mean}(\{x\})$ and $\text{mean}(\{y\})$.
Compute the standard error for the difference between the means. $$s_{ed} = \sqrt{\left( \frac{\text{std}(\{x\})^2 (k_x - 1) + \text{std}(\{y\})^2 (k_y - 1)}{k_x + k_y - 2} \right) \left( \frac{k_x k_y}{k_x + k_y} \right)}$$
Compute the value of the test statistic. $$s = \frac{\text{mean}(\{x\}) - \text{mean}(\{y\})}{s_{ed}}$$
Compute the p-value using the t-distribution with the following number of degrees of freedom: $$N = k_x + k_y - 2$$

Testing Whether Two Populations Have the Same Mean, for Different, Unknown Population Standard Deviations¶

The initial hypothesis is that the populations have the same, unknown, mean.
Let $\{x\}$ be the sample of the first population.
Let $\{y\}$ be the sample of the second population.
Let $k_x$ be the sample size of the first population.
Let $k_y$ be the sample size of the second population.

Compute the sample means for each population, $\text{mean}(\{x\})$ and $\text{mean}(\{y\})$.
Compute the standard error for the difference between the means. $$s_{ed} = \sqrt{\frac{\text{stdunbiased}(\{x\})^2}{k_x} + \frac{\text{stdunbiased}(\{y\})^2}{k_y}}$$
Compute the value of the test statistic. $$s = \frac{\text{mean}(\{x\}) - \text{mean}(\{y\})}{s_{ed}}$$
Compute the p-value using the t-distribution with the following number of degrees of freedom: $$N = \frac{\left( \frac{\text{stdunbiased}(\{x\})^2}{k_x} + \frac{\text{stdunbiased}(\{y\})^2}{k_y} \right)^2}{\frac{\left[ \frac{\text{stdunbiased}(\{x\})^2}{k_x} \right]^2}{k_x - 1} + \frac{\left[ \frac{\text{stdunbiased}(\{y\})^2}{k_y} \right]^2}{k_y - 1}}$$

The F-Test of Significance for Equality of Variance¶

Given two datasets $\{x\}$ of $N_x$ items and $\{y\}$ of $N_y$ items, assess the significance of evidence against the hypothesis that the populations represented by these two datasets have the same variance.
Assume that the alternative hypothesis is that the population represented by $\{x\}$ has the larger variance.

Compute the value of the test statistic. $$F = \frac{\text{stdunbiased}(\{x\})^2}{\text{stdunbiased}(\{y\})^2}$$
Compute the p-value using the f-distribution with the following numbers of degrees of freedom: $$\{ N_x - 1, N_y - 1 \}$$ $$p = \int_F^\infty p_f(u; N_x - 1, N_y - 1) du$$

The $\chi^2$-Test of Significance of Fit to a Model¶

The model consists of $k$ disjoint events $\mathcal{E}_1, ..., \mathcal{E}_k$ which cover the space of outcomes and the probability $P(\mathcal{E})$ of each event.
The model has $p$ unknown parameters.

Perform $N$ experiments and record the number of times each event occurs in the experiments.
The theoretical frequency of the $i$'th event for this experiment is $NP(\mathcal{E})$.
Write $f_o(\mathcal{E}_i)$ for the observed frequency of event $i$ and $f_t(\mathcal{E}_i)$ for the theoretical frequency of the event under the null hypothesis.
Compute the value of the test statistic. $$C = \sum_i \frac{\left(f_o(\mathcal{E}_i) - f_t(\mathcal{E}_i) \right)^2}{f_t(\mathcal{E}_i)}$$
Compute the p-value using the $\chi^2$-distribution with the following number of degrees of freedom: $$N = k - p - 1$$ $$p = \int_C^\infty p_{\chi^2}(u; k - p - 1) du$$

8. Experiments¶

Overview¶

Background: An experiment tries to evaluate the effects of one or more treatments.
1. Allocate subjects to groups at random, so that each group looks similar.
2. Apply the treatments at different levels to different groups.
3. Observe whether the groups are different after the treatments.
Goal: How do you determine whether the differences between groups are due to the treatments?

Evaluating Whether a Treatment Has Significant Effects with a One-Way ANOVA for Balanced Experiments¶

Choose $L$ levels of treatment.
Randomize $LG$ subjects into $L$ treatment groups of $G$ subjects each.
Teach each subject, and record the results.

Terminology¶

Value for $j$'th Subject in $i$'th Treatment Level Group: $$x_{ij}$$
Overall Mean: $$\hat{\mu} = \frac{\sum_{ij} x_{ij}}{GL}$$
$i$'th Group Mean: $$\hat{\mu}_i = \frac{\sum_{j} x_{ij}}{G}$$
Sum of Squares Within Group (i.e. Errors): $$SS_W = \sum_{ij} \left( x_{ij} - \hat{\mu}_i \right)^2$$
Sum of Squares Between Group (i.e. Errors): $$SS_B = G \left[ \sum_{i} \left( \hat{\mu} - \hat{\mu}_i \right)^2 \right]$$
Mean Squares Within Group (i.e. Residual Variation): $$MS_W = \frac{SS_W}{L(G - 1)}$$
Mean Squares Between Group (i.e. Treatment Variation): $$MS_B = \frac{SS_B}{L - 1}$$
Value of F-Statistic: $$F = \frac{MS_B}{MS_W}$$
p-Value: $$\{ L - 1, L(G - 1) \}$$ $$p = \int_F^\infty p_f(u; L - 1, L(G - 1)) du$$

Construction¶

	Deg. of Freedom	Sum Sq.	Mean Sq.	F Value	Pr(>F)
Treatment	$L - 1$	$SS_B$	$MS_B$	$\frac{MS_B}{MS_W}$	p-value
Residuals	$L(G - 1)$	$SS_W$	$MS_W$

Interpretation¶

If the p-value is small enough, then only an extremly unlikely set of samples could explain the difference between the levels of treatment as sampling error; it is more likely the treatment has an effect.

Unbalanced Experiments¶

Balanced Experiments: Each group has the same number of subjects.
Unbalanced Experiments: Each group has a different number of subjects.

Unbalanced Residual Variation¶

Assume $i$'th group has $G_i$ subjects.
Degrees of Freedom: $\sum_i G_i - L$

$$MS_W = \frac{1}{L} \sum_j \left[ \frac{\sum_i (x_{ij} - \hat{\mu}_i)^2}{G_i - 1} \right]$$

Unbalanced Treatment Variation¶

Assume $i$'th group has $G_i$ subjects.
Degrees of Freedom: $L - 1$

$$MS_B = \frac{\sum_i G_i (\hat{\mu}_{i} - \hat{\mu})^2}{L - 1}$$