Pattern Recognition and Machine Learning: Chapter 1
1.2 Probability Theory
The multivariate Guassian is given by \(N(x \vert{} \mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}\frac{1}{\vert{} \Sigma \vert{}^{1/2}} \exp(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x-\mu))\)
Where we have a \(D\) dimensional vector \(x\) of continuous variables, a \(D*D\) covariance matrix where \(D_{ij} = cov(x_i, x_j)\) , and \(\vert{} \Sigma \vert{}\) denotes the determinant of the covariance matrix.
Example: Maximum Likelihood with Gaussian Distribution
Given \(x = (x_1 .. x_n)^T\) observations of a scalar with the IID assumptions, and we assume that each \(x_n\) is drawn from a Gaussian distribution with mean \(\mu\) and \(\sigma^2\), we can write down the likelihood:
\[p(x \vert{} \mu, \sigma^2) = \prod_{n=1}^{N} N(x \vert{} \mu, \sigma^2)\]The log-likelihood can be given by \(\frac{-1}{2\sigma^2}\sum_{n=1}^{N}(x_n - \mu)^2 - \frac{N}{2}\ln(\sigma^2) - \frac{N}{2}\ln(2\pi)\)
Maximum likelihood gives us:
\(\hat{\mu} = \frac{1}{N}\sum_{n=1}^{N}x_n\) and \(\hat{\sigma^2} = \frac{1}{N}\sum_{n=1}^{N}(x_n - \hat{\mu})^2\)
i.e., the typical values for the sample mean and (uncorrected) sample standard deviation. If we apply Bessel’s correction and multiply the sample standard deviation by \(\frac{N}{N-1}\) then we obtain an unbiased estimate for the standard deviation.
Downsides of simple maximum likelihood estimation
- We can show that while the mean obtained through this estimation is unbiased, the variance is a biased estimate of the true distribution variance.
- Biased in this context means that \(E[\hat{\theta}] - \theta \neq{} 0\)
- This further means that if you take a bunch of samples of \(x\) (tending towards infinity) and predict \(\hat{\theta}\) for each sample, then the difference between the average prediction and the actual parameter will be \(0\).
- Practically, this doesn’t say much since:
- It doesn’t make any statement about the difference of a single point estimate and the true parameter
- In statistical learning, the underlying parameter is often unknown, so this quantity is impossible to compute.
- But, it does provide some insight as to why model averaging and ensemble learning works well.
- We have \(E[\hat{\mu}] = E[\frac{1}{N}\sum_{n=1}^{N}x_n] = \frac{1}{N}\sum_{n=1}^{N}E[x_n] = \frac{1}{N}\sum_{n=1}^{N}\mu = \mu\)