Probability and Other Preliminaries

Probability and Other Preliminaries

  • probability distribution
  • rules of probability
  • review for linear algebra

Preview

  • topics this week mainly concerns a brief review for probability and linear algebra
  • lab class introduces use of Jupyter, Python and Pandas

Probability Distribution

The probability distribution function of a random variable \(X\) is

$$ F(x) = P(X \leq x) $$

where the notation \(\{X \leq x\}\) consists of all outcomes smaller than or equal to \(x\).

The derivative

$$ p(x) = \frac{dF(x)}{dx} $$

is called the probability density function of \(X\) .

Wikipedia: Probability distribution / Probability density function

Joint Distribution

The joint distribution function of two random variables \(X\) and \(Y\) is the probability of the joint statistics \(\{X \leq x, Y \leq y\}\), ie,

$$ F(x, y) = P(X \leq x, Y \leq y) $$

The derivative

$$ p(x, y) = \frac{\partial^2 F(x, y)}{\partial x \partial y} $$

is called the joint density function of \(X\) and \(Y\) .

  • \(X\) and \(Y\) are independent if and only if \(p(x,y) = p(x)p(y)\)

Wikipedia: Joint probability distribution

Conditional Distribution

Given two jointly distributed random variables \(X\) and \(Y\), the conditional probability distribution of \(Y\) given \(X\) is the probability distribution of \(Y\) when \(X\) is known to be a particular value.

The conditional density function of \(y\) given the occurrence of the value \(x\) is

$$ p(y|x) = \frac{p(x,y)}{p(x)} $$

Wikipedia: Conditional probability distribution

Gaussian (Normal) Distribution

The probability density of the Gaussian distribution is

$$ p(x\ |\ \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right) $$

where \(\mu\) is the mean and \(\sigma^2\) is the variance of the distribution.

  • very common in natural and social sciences
  • \(\sigma\) is the standard deviation

Wikipedia: Normal distribution

The Normal (Gaussian) Distribution

  • about 68% of values drawn from a Gaussian distribution are within one standard deviation \(\sigma\) from the mean \(\mu\)

Multivariate Gaussian (Normal) Distribution

The probability density of the \(k\)-dimensional Gaussian distribution is

$$ p(\mathbf{x}\ |\ \boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{\sqrt{2\pi^k |\boldsymbol{\Sigma}|}} \exp\left( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right) $$

where \(\boldsymbol{\mu}\) is the \(k\times 1\) mean vector and \(\boldsymbol{\Sigma}\) is the \(k\times k\) covariance matrix.

  • \(|\boldsymbol{\Sigma}|\) and \(\boldsymbol{\Sigma}^{-1}\) are the determinant and the inverse of the covariance
  • a symbol \(~^\top\) indicates the transpose

Wikipedia: Multivariate normal distribution

Notation

Formally we should write out \(p(X=x,Y=y)\) .

In practice we often use \(p(x,y)\) .

  • this looks very much like we might write a multivariate function, eg, \(f(x,y) = \frac{x}{y}\)
  • for a multivariate function though, \(f(x,y)\neq f(y,x)\)
  • however \(p(x,y) = p(y,x)\) because \(p(X=x,Y=y) = p(Y=y,X=x)\)

We now quickly review the rules of probability.

Normalisation

All distributions are normalised.

  • this is clear from the fact that \(\sum_{x\in {\cal X}} n_{x} = N\), which gives
    $$ \sum_{x\in {\cal X}} p(x) = \lim_{N\rightarrow\infty} \frac{\sum_{x\in {\cal X}} n_x}{N} = \lim_{N\rightarrow\infty} \frac{N}{N} = 1 $$

A similar result can be derived for the marginal and conditional distributions.

Product Rule and Sum Rule

The product rule of probability:

$$ \underbrace{p(x,y)}_{\text{joint probability}} = \underbrace{p(y|x)}_{\text{conditional probability}}\cdot\ p(x) $$

The sum rule of probability:

$$ \underbrace{p(y)}_{\text{marginal probability}} = \sum_{x\in {\cal X}} p(x,y) = \sum_{x\in {\cal X}} p(y|x)p(x) $$

Wikipedia: Product rule / Probability axioms

Bayes' Theorem

Bayes' theorem immediately follows the product rule:

$$ p(x|y) = \frac{p(x,y)}{p(y)} = \frac{p(y|x)p(x)}{\displaystyle \sum_{x\in {\cal X}} p(y|x)p(x)} $$

Wikipedia: Bayes' theorem

(Example) There are two barrels in front of you. Barrel One(B1) contains 20 apples and 4 oranges. Barrel Two(B2) contains 4 apples and 8 oranges. You choose a barrel randomly and select a fruit. It is an apple. What is the probability that the barrel was Barrel One?

(Solution) we are given that:

$$ \begin{align*} p(\text{apple}\ |\ \text{B}_1) = & \frac{20}{24} & \qquad p(\text{B}_1) = 0.5 \\ p(\text{apple}\ |\ \text{B}_2) = & \frac{4}{12} & \qquad p(\text{B}_2) = 0.5 \end{align*} $$

Use the sum rule to calculate

$$ p(\text{apple}) = p(\text{apple}\ |\ \text{B}_1)p(\text{B}_1) + p(\text{apple}\ |\ \text{B}_2)P(\text{B}_2) = \frac{20}{24}\times 0.5 + \frac{4}{12}\times 0.5 = \frac{7}{12} $$

and Bayes' theorem tells us that:

$$ p(\text{B}_1\ |\ \text{apple}) = \frac{p(\text{apple}\ |\ \text{B}_1)P(\text{B}_1)}{P(\text{apple})} = \frac{\frac{20}{24}\times 0.5}{\frac{7}{12}} = \frac{5}{7} $$

Expected Value

The expected value (or mean, average) of a random variable \(X\) is

$$ \mathbb{E}[X] = \int_{-\infty}^{\infty} xp(x) dx $$
  • discrete type is \(\mathbb{E}[X] = \sum_{x\in {\cal X}} x p(x)\) for all possible events \({\cal X}\)

The expected value of a function \(f(x)\) is

$$ \mathbb{E}[f(x)] = \int_{-\infty}^{\infty} f(x) p(x) dx $$

Wikipedia: Expected value

Variance

The variance is the expected value of \(f(x) = (x - \mathbb{E}[X])^2\), ie,

$$ \mathbb{V}ar[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx $$
  • discrete type is \(\mathbb{V}ar[X] = \sum_{x\in {\cal X}} (x - \mathbb{E}[X])^2 p(x)\)

(note) $ \mathbb{V}ar[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2 - 2X\mathbb{E}(X) + \mathbb{E}[X]^2] = \mathbb{E}[X^2] - 2\mathbb{E}(X)\mathbb{E}(X) + \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2 $

Wikipedia: Variance

Derivatives with Vectors

We have scalars \(x\), \(y\), and \(n\)- and \(m\)-dimensional vectors \(\mathbf{x}\), \(\mathbf{y}\), where

$$ \mathbf{x} = \left( \begin{array}{c} x_1 \\ \vdots \\ x_n \end{array} \right) \qquad \mathbf{y} = \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) $$

Derivatives with vectors using the denominator-layout notation:

$$ \frac{\partial \mathbf{y}}{\partial x} = \left( \frac{\partial y_1}{\partial x} \cdots \frac{\partial y_m}{\partial x} \right) \qquad \frac{\partial y}{\partial \mathbf{x}} = \left( \begin{array}{c} \frac{\partial y}{\partial x_1} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{array} \right) \qquad \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \left( \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \vdots & & \vdots \\ \frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right) $$

Some Scalar-by-Vector Identities

For vectors \(\mathbf{a}\), \(\mathbf{w}\) and a square matrix \(\mathbf{A}\) :

$$ \begin{align*} \frac{\partial \mathbf{a}^\top \mathbf{w}}{\partial \mathbf{w}} = & \mathbf{a} \\ \frac{\partial \mathbf{w}^\top \mathbf{A} \mathbf{w}}{\partial \mathbf{w}} = & (\mathbf{A} + \mathbf{A}^\top)\mathbf{w} \end{align*} $$

Wikipedia: Matrix calculus

blogroll

social