Probability and Other Preliminaries

probability distribution
rules of probability
review for linear algebra

Preview

topics this week mainly concerns a brief review for probability and linear algebra
lab class introduces use of Jupyter, Python and Pandas

Probability Distribution

The probability distribution function of a random variable $X$ is

$$ F(x) = P(X \leq x) $$

where the notation $\{X \leq x\}$ consists of all outcomes smaller than or equal to $x$.

The derivative

$$ p(x) = \frac{dF(x)}{dx} $$

is called the probability density function of $X$ .

Wikipedia: Probability distribution / Probability density function

Joint Distribution

The joint distribution function of two random variables $X$ and $Y$ is the probability of the joint statistics $\{X \leq x, Y \leq y\}$, ie,

$$ F(x, y) = P(X \leq x, Y \leq y) $$

The derivative

$$ p(x, y) = \frac{\partial^2 F(x, y)}{\partial x \partial y} $$

is called the joint density function of $X$ and $Y$ .

$X$ and $Y$ are independent if and only if $p(x,y) = p(x)p(y)$

Wikipedia: Joint probability distribution

Conditional Distribution

Given two jointly distributed random variables $X$ and $Y$, the conditional probability distribution of $Y$ given $X$ is the probability distribution of $Y$ when $X$ is known to be a particular value.

The conditional density function of $y$ given the occurrence of the value $x$ is

$$ p(y|x) = \frac{p(x,y)}{p(x)} $$

Wikipedia: Conditional probability distribution

Gaussian (Normal) Distribution

The probability density of the Gaussian distribution is

$$ p(x\ |\ \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right) $$

where $\mu$ is the mean and $\sigma^2$ is the variance of the distribution.

very common in natural and social sciences
$\sigma$ is the standard deviation

Wikipedia: Normal distribution

The Normal (Gaussian) Distribution

about 68% of values drawn from a Gaussian distribution are within one standard deviation $\sigma$ from the mean $\mu$

Multivariate Gaussian (Normal) Distribution

The probability density of the $k$-dimensional Gaussian distribution is

$$ p(\mathbf{x}\ |\ \boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{\sqrt{2\pi^k |\boldsymbol{\Sigma}|}} \exp\left( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right) $$

where $\boldsymbol{\mu}$ is the $k\times 1$ mean vector and $\boldsymbol{\Sigma}$ is the $k\times k$ covariance matrix.

$|\boldsymbol{\Sigma}|$ and $\boldsymbol{\Sigma}^{-1}$ are the determinant and the inverse of the covariance
a symbol $~^\top$ indicates the transpose

Wikipedia: Multivariate normal distribution

Notation

Formally we should write out $p(X=x,Y=y)$ .

In practice we often use $p(x,y)$ .

this looks very much like we might write a multivariate function, eg, $f(x,y) = \frac{x}{y}$
for a multivariate function though, $f(x,y)\neq f(y,x)$
however $p(x,y) = p(y,x)$ because $p(X=x,Y=y) = p(Y=y,X=x)$

We now quickly review the rules of probability.

Normalisation

All distributions are normalised.

this is clear from the fact that $\sum_{x\in {\cal X}} n_{x} = N$, which gives
$$ \sum_{x\in {\cal X}} p(x) = \lim_{N\rightarrow\infty} \frac{\sum_{x\in {\cal X}} n_x}{N} = \lim_{N\rightarrow\infty} \frac{N}{N} = 1 $$

A similar result can be derived for the marginal and conditional distributions.

Product Rule and Sum Rule

The product rule of probability:

$$ \underbrace{p(x,y)}_{\text{joint probability}} = \underbrace{p(y|x)}_{\text{conditional probability}}\cdot\ p(x) $$

The sum rule of probability:

$$ \underbrace{p(y)}_{\text{marginal probability}} = \sum_{x\in {\cal X}} p(x,y) = \sum_{x\in {\cal X}} p(y|x)p(x) $$

Wikipedia: Product rule / Probability axioms

Bayes' Theorem

Bayes' theorem immediately follows the product rule:

$$ p(x|y) = \frac{p(x,y)}{p(y)} = \frac{p(y|x)p(x)}{\displaystyle \sum_{x\in {\cal X}} p(y|x)p(x)} $$

Wikipedia: Bayes' theorem

(Example) There are two barrels in front of you. Barrel One(B1) contains 20 apples and 4 oranges. Barrel Two(B2) contains 4 apples and 8 oranges. You choose a barrel randomly and select a fruit. It is an apple. What is the probability that the barrel was Barrel One?

(Solution) we are given that:

$$ \begin{align*} p(\text{apple}\ |\ \text{B}_1) = & \frac{20}{24} & \qquad p(\text{B}_1) = 0.5 \\ p(\text{apple}\ |\ \text{B}_2) = & \frac{4}{12} & \qquad p(\text{B}_2) = 0.5 \end{align*} $$

Use the sum rule to calculate

$$ p(\text{apple}) = p(\text{apple}\ |\ \text{B}_1)p(\text{B}_1) + p(\text{apple}\ |\ \text{B}_2)P(\text{B}_2) = \frac{20}{24}\times 0.5 + \frac{4}{12}\times 0.5 = \frac{7}{12} $$

and Bayes' theorem tells us that:

$$ p(\text{B}_1\ |\ \text{apple}) = \frac{p(\text{apple}\ |\ \text{B}_1)P(\text{B}_1)}{P(\text{apple})} = \frac{\frac{20}{24}\times 0.5}{\frac{7}{12}} = \frac{5}{7} $$

Expected Value

The expected value (or mean, average) of a random variable $X$ is

$$ \mathbb{E}[X] = \int_{-\infty}^{\infty} xp(x) dx $$

discrete type is $\mathbb{E}[X] = \sum_{x\in {\cal X}} x p(x)$ for all possible events ${\cal X}$

The expected value of a function $f(x)$ is

$$ \mathbb{E}[f(x)] = \int_{-\infty}^{\infty} f(x) p(x) dx $$

Wikipedia: Expected value

Variance

The variance is the expected value of $f(x) = (x - \mathbb{E}[X])^2$, ie,

$$ \mathbb{V}ar[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx $$

discrete type is $\mathbb{V}ar[X] = \sum_{x\in {\cal X}} (x - \mathbb{E}[X])^2 p(x)$

(note) $ \mathbb{V}ar[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2 - 2X\mathbb{E}(X) + \mathbb{E}[X]^2] = \mathbb{E}[X^2] - 2\mathbb{E}(X)\mathbb{E}(X) + \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2 $

Wikipedia: Variance

Derivatives with Vectors

We have scalars $x$, $y$, and $n$- and $m$-dimensional vectors $\mathbf{x}$, $\mathbf{y}$, where

$$ \mathbf{x} = \left( \begin{array}{c} x_1 \\ \vdots \\ x_n \end{array} \right) \qquad \mathbf{y} = \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) $$

Derivatives with vectors using the denominator-layout notation:

$$ \frac{\partial \mathbf{y}}{\partial x} = \left( \frac{\partial y_1}{\partial x} \cdots \frac{\partial y_m}{\partial x} \right) \qquad \frac{\partial y}{\partial \mathbf{x}} = \left( \begin{array}{c} \frac{\partial y}{\partial x_1} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{array} \right) \qquad \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \left( \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \vdots & & \vdots \\ \frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right) $$

Some Scalar-by-Vector Identities

For vectors $\mathbf{a}$, $\mathbf{w}$ and a square matrix $\mathbf{A}$ :

$$ \begin{align*} \frac{\partial \mathbf{a}^\top \mathbf{w}}{\partial \mathbf{w}} = & \mathbf{a} \\ \frac{\partial \mathbf{w}^\top \mathbf{A} \mathbf{w}}{\partial \mathbf{w}} = & (\mathbf{A} + \mathbf{A}^\top)\mathbf{w} \end{align*} $$

Wikipedia: Matrix calculus