# Pattern Recognition and Machine Learning: Chapter 1

### Pattern Recognition and Machine Learning: Chapter 1

#### 1.2 Probability Theory

The multivariate Guassian is given by $$N(x \vert{} \mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}\frac{1}{\vert{} \Sigma \vert{}^{1/2}} \exp(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x-\mu))$$

Where we have a $$D$$ dimensional vector $$x$$ of continuous variables, a $$D*D$$ covariance matrix where $$D_{ij} = cov(x_i, x_j)$$ , and $$\vert{} \Sigma \vert{}$$ denotes the determinant of the covariance matrix.

#### Example: Maximum Likelihood with Gaussian Distribution

Given $$x = (x_1 .. x_n)^T​$$ observations of a scalar with the IID assumptions, and we assume that each $$x_n​$$ is drawn from a Gaussian distribution with mean $$\mu​$$ and $$\sigma^2​$$, we can write down the likelihood:

$p(x \vert{} \mu, \sigma^2) = \prod_{n=1}^{N} N(x \vert{} \mu, \sigma^2)$

The log-likelihood can be given by $$\frac{-1}{2\sigma^2}\sum_{n=1}^{N}(x_n - \mu)^2 - \frac{N}{2}\ln(\sigma^2) - \frac{N}{2}\ln(2\pi)$$

Maximum likelihood gives us:

$$\hat{\mu} = \frac{1}{N}\sum_{n=1}^{N}x_n$$ and $$\hat{\sigma^2} = \frac{1}{N}\sum_{n=1}^{N}(x_n - \hat{\mu})^2$$

i.e., the typical values for the sample mean and (uncorrected) sample standard deviation. If we apply Bessel’s correction and multiply the sample standard deviation by $$\frac{N}{N-1}$$ then we obtain an unbiased estimate for the standard deviation.

Downsides of simple maximum likelihood estimation

• We can show that while the mean obtained through this estimation is unbiased, the variance is a biased estimate of the true distribution variance.
• Biased in this context means that $$E[\hat{\theta}] - \theta \neq{} 0$$
• This further means that if you take a bunch of samples of $$x$$ (tending towards infinity) and predict $$\hat{\theta}$$ for each sample, then the difference between the average prediction and the actual parameter will be $$0$$.
• Practically, this doesn’t say much since:
• It doesn’t make any statement about the difference of a single point estimate and the true parameter
• In statistical learning, the underlying parameter is often unknown, so this quantity is impossible to compute.
• But, it does provide some insight as to why model averaging and ensemble learning works well.
• We have $$E[\hat{\mu}] = E[\frac{1}{N}\sum_{n=1}^{N}x_n] = \frac{1}{N}\sum_{n=1}^{N}E[x_n] = \frac{1}{N}\sum_{n=1}^{N}\mu = \mu$$