Probability and Other Preliminaries
- probability distribution
- rules of probability
- review for linear algebra
Preview
- topics this week mainly concerns a brief review for probability and linear algebra
- lab class introduces use of Jupyter, Python and Pandas
Probability Distribution
The probability distribution function of a random variable \(X\) is
$$
F(x) = P(X \leq x)
$$
where the notation \(\{X \leq x\}\) consists of all outcomes smaller than or equal to \(x\).
The derivative
$$
p(x) = \frac{dF(x)}{dx}
$$
is called the probability density function of \(X\) .
Wikipedia:
Probability distribution /
Probability density function
Joint Distribution
The joint distribution function of two random variables \(X\) and \(Y\) is the probability of the joint statistics \(\{X \leq x, Y \leq y\}\), ie,
$$
F(x, y) = P(X \leq x, Y \leq y)
$$
The derivative
$$
p(x, y) = \frac{\partial^2 F(x, y)}{\partial x \partial y}
$$
is called the joint density function of \(X\) and \(Y\) .
- \(X\) and \(Y\) are independent if and only if \(p(x,y) = p(x)p(y)\)
Wikipedia: Joint probability distribution
Conditional Distribution
Given two jointly distributed random variables \(X\) and \(Y\), the conditional probability distribution of \(Y\) given \(X\) is the probability distribution of \(Y\) when \(X\) is known to be a particular value.
The conditional density function of \(y\) given the occurrence of the value \(x\) is
$$
p(y|x) = \frac{p(x,y)}{p(x)}
$$
Wikipedia: Conditional probability distribution
Gaussian (Normal) Distribution
The probability density of the Gaussian distribution is
$$
p(x\ |\ \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)
$$
where \(\mu\) is the mean and \(\sigma^2\) is the variance of the distribution.
- very common in natural and social sciences
- \(\sigma\) is the standard deviation
Wikipedia: Normal distribution
The Normal (Gaussian) Distribution
- about 68% of values drawn from a Gaussian distribution are within one standard deviation \(\sigma\) from the mean \(\mu\)
Multivariate Gaussian (Normal) Distribution
The probability density of the \(k\)-dimensional Gaussian distribution is
$$
p(\mathbf{x}\ |\ \boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{\sqrt{2\pi^k |\boldsymbol{\Sigma}|}} \exp\left( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right)
$$
where \(\boldsymbol{\mu}\) is the \(k\times 1\) mean vector and \(\boldsymbol{\Sigma}\) is the \(k\times k\) covariance matrix.
- \(|\boldsymbol{\Sigma}|\) and \(\boldsymbol{\Sigma}^{-1}\) are the determinant and the inverse of the covariance
- a symbol \(~^\top\) indicates the transpose
Wikipedia: Multivariate normal distribution
Notation
Formally we should write out \(p(X=x,Y=y)\) .
In practice we often use \(p(x,y)\) .
- this looks very much like we might write a multivariate function, eg, \(f(x,y) = \frac{x}{y}\)
- for a multivariate function though, \(f(x,y)\neq f(y,x)\)
- however \(p(x,y) = p(y,x)\) because \(p(X=x,Y=y) = p(Y=y,X=x)\)
We now quickly review the rules of probability.
Normalisation
All distributions are normalised.
A similar result can be derived for the marginal and conditional distributions.
Product Rule and Sum Rule
The product rule of probability:
$$
\underbrace{p(x,y)}_{\text{joint probability}} = \underbrace{p(y|x)}_{\text{conditional probability}}\cdot\ p(x)
$$
The sum rule of probability:
$$
\underbrace{p(y)}_{\text{marginal probability}} = \sum_{x\in {\cal X}} p(x,y) = \sum_{x\in {\cal X}} p(y|x)p(x)
$$
Wikipedia:
Product rule /
Probability axioms
Bayes' Theorem
Bayes' theorem immediately follows the product rule:
$$
p(x|y) = \frac{p(x,y)}{p(y)} = \frac{p(y|x)p(x)}{\displaystyle \sum_{x\in {\cal X}} p(y|x)p(x)}
$$
Wikipedia: Bayes' theorem
(Example)
There are two barrels in front of you.
Barrel One(B1) contains 20 apples and 4 oranges.
Barrel Two(B2) contains 4 apples and 8 oranges.
You choose a barrel randomly and select a fruit.
It is an apple.
What is the probability that the barrel was Barrel One?
(Solution) we are given that:
$$
\begin{align*}
p(\text{apple}\ |\ \text{B}_1) = & \frac{20}{24} & \qquad p(\text{B}_1) = 0.5 \\
p(\text{apple}\ |\ \text{B}_2) = & \frac{4}{12} & \qquad p(\text{B}_2) = 0.5
\end{align*}
$$
Use the sum rule to calculate
$$
p(\text{apple}) = p(\text{apple}\ |\ \text{B}_1)p(\text{B}_1) + p(\text{apple}\ |\ \text{B}_2)P(\text{B}_2) = \frac{20}{24}\times 0.5 + \frac{4}{12}\times 0.5 = \frac{7}{12}
$$
and Bayes' theorem tells us that:
$$
p(\text{B}_1\ |\ \text{apple}) = \frac{p(\text{apple}\ |\ \text{B}_1)P(\text{B}_1)}{P(\text{apple})} = \frac{\frac{20}{24}\times 0.5}{\frac{7}{12}} = \frac{5}{7}
$$
Expected Value
The expected value (or mean, average) of a random variable \(X\) is
$$
\mathbb{E}[X] = \int_{-\infty}^{\infty} xp(x) dx
$$
- discrete type is \(\mathbb{E}[X] = \sum_{x\in {\cal X}} x p(x)\) for all possible events \({\cal X}\)
The expected value of a function \(f(x)\) is
$$
\mathbb{E}[f(x)] = \int_{-\infty}^{\infty} f(x) p(x) dx
$$
Wikipedia: Expected value
Variance
The variance is the expected value of \(f(x) = (x - \mathbb{E}[X])^2\), ie,
$$
\mathbb{V}ar[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx
$$
- discrete type is \(\mathbb{V}ar[X] = \sum_{x\in {\cal X}} (x - \mathbb{E}[X])^2 p(x)\)
(note)
$
\mathbb{V}ar[X]
= \mathbb{E}[(X - \mathbb{E}[X])^2]
= \mathbb{E}[X^2 - 2X\mathbb{E}(X) + \mathbb{E}[X]^2]
= \mathbb{E}[X^2] - 2\mathbb{E}(X)\mathbb{E}(X) + \mathbb{E}[X]^2
= \mathbb{E}[X^2] - \mathbb{E}[X]^2
$
Wikipedia: Variance
Derivatives with Vectors
We have scalars \(x\), \(y\), and \(n\)- and \(m\)-dimensional vectors \(\mathbf{x}\), \(\mathbf{y}\), where
$$
\mathbf{x} = \left( \begin{array}{c} x_1 \\ \vdots \\ x_n \end{array} \right)
\qquad
\mathbf{y} = \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right)
$$
Derivatives with vectors using the denominator-layout notation:
$$
\frac{\partial \mathbf{y}}{\partial x} = \left( \frac{\partial y_1}{\partial x} \cdots \frac{\partial y_m}{\partial x} \right)
\qquad
\frac{\partial y}{\partial \mathbf{x}} = \left( \begin{array}{c} \frac{\partial y}{\partial x_1} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{array} \right)
\qquad
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \left( \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \vdots & & \vdots \\ \frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right)
$$
Some Scalar-by-Vector Identities
For vectors \(\mathbf{a}\), \(\mathbf{w}\) and a square matrix \(\mathbf{A}\) :
$$
\begin{align*}
\frac{\partial \mathbf{a}^\top \mathbf{w}}{\partial \mathbf{w}} = & \mathbf{a} \\
\frac{\partial \mathbf{w}^\top \mathbf{A} \mathbf{w}}{\partial \mathbf{w}} = & (\mathbf{A} + \mathbf{A}^\top)\mathbf{w}
\end{align*}
$$
Wikipedia: Matrix calculus