Theorem (Jenson’s Inequality). Let \(f\) be a convex function (i.e \(f^{"} \geqslant 0\), or for vector-valued inputs the hessian \(H\) is positive semi-definite, i.e. \(H \geqslant 0\)) and \(X\) be a random variable. Then \(\mathbb{E}[f(X)] \geqslant f(\mathbb{E}[X])\). If \(f\) is strictly convex (i.e. \(f^{"}> 0\), or \(H\) is positive definite, i.e. \(H>0\)), then \(\mathbb{E}[f(X)]\)\(= f(\mathbb{E}[X])\) if and only if \(X=\mathbb{E}[X]\) with probability \(1\), i.e. \(X\) is a constant.

Jenson’s inequality also holds for concave and strictly concave functions with the direction of inequality reversed. Note that \(f\) is concave if \(-f\) is convex and that \(f\) is strictly concave if \(-f\) is strictly convex.

Let \(S = \{x^{(1)}, x^{(2)}, ..., x^{(m)}\}\) be a training set. The goal is to model the distribution \(p(x,z)\) where \(z\) are the latent random variables of the model. Let us define the log likelihood of the model as follows:

$$ \begin{eqnarray} l(\theta) &=& log \prod_{i=1}^m p(x^{(i)}; \theta)\\ &=& \sum_{i=1}^m log \sum_{z^{(i)}} p(x^{(i)},z^{(i)}; \theta) \end{eqnarray} $$

Note that the term \(\sum_{z^{(i)}}\) is a summation over all possible values that \(z^{(i)}\) can take.

Let us define \(Q_i\) to be a distribution over \(z^{(i)}\), i.e. \(\sum_{z^{(i)}} Q_i(z^{(i)}) = 1\) and \(Q(z^{(i)})\geqslant0\). Therefore:

$$ l(\theta) = \sum_{i=1}^m log \sum_{z^{(i)}} Q_i(z^{(i)})\frac{p(x^{(i)},z^{(i)}; \theta)} {Q_i(z^{(i)})} $$

Noting that the term \(Q_i(z^{(i)})\left(\frac{p(x^{(i)},z^{(i)}; \theta)}{Q_i(z^{(i)})}\right)\) is essentially \(\mathbb{E}_{z \sim Q_i(z)} \left[ \frac{p(x^{(i)},z^{(i)}; \theta)}{Q_i(z^{(i)})}\right]\) and using Jenson’s inequality (for concave functions) we have:

$$ l(\theta) \geqslant \sum_{i=1}^m \sum_{z^{(i)}}Q_i(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right) $$

Suppose that we have some guess of the parameters of the model, \(\theta\). To transform the inequality above into an equality for this particular value of \(\theta\) we require that:

$$ \frac{p(x^{(i)},z^{(i)}; \theta)}{Q_i(z^{(i)})} = c $$

where \(c\) is a constant that does not depend on \(z\). Let us choose \(c\) to be \(\sum_{z^{(i)}} p(x^{(i)},z^{(i)})\). Note that this makes the lower bound on \(l(\theta)\) tight. Thus:

$$ \begin{eqnarray} Q_i(z^{(i)}) &=& \frac{p(x^{(i)},z^{(i)}; \theta)}{\sum_{z^{(i)}}p(x^{(i)},z^{(i)})}\\ &=& \frac{p(x^{(i)},z^{(i)}; \theta)}{p(x^{(i)})}\\ &=& p(z^{(i)} \vert x^{(i)}) \end{eqnarray} $$

The EM algorithm is then given as follows:

Repeat until convergence
{

  1. (E-step) For each \(i\) set:

    $$ Q_i(z^{(i)}) := p(z^{(i)} \vert x^{(i)}) $$
  2. (M-step) Update the parameters:

$$ \theta := \underset{\theta}{argmax} \sum_{i=1}^m \sum_{z^{(i)}} Q_i(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right) $$

}

If we define \(J = \sum_{i=1}^m \sum_{z^{(i)}}Q_i(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right)\), then the EM algorithm is essentially coordinate ascent on \(J\). The E-step maximizes \(J\) with respect to \(Q\) (by making the bound tight) and the M-step maximizes \(J\) with respect to \(\theta\). Note that while we have not proved \(J\) to be concave, it generally turns out to be so for the models we encounter.

Let us now prove the convergence of the EM algorithm. Let \(\theta^{(t)}\) be the parameters that the model starts with at time \(t\). Then:

$$ l(\theta^{(t)}) = \sum_{i=1}^m \sum_{z^{(i)}}Q_i^{(t)}(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t)})}{Q_i^{(t)}(z^{(i)})}\right) $$

We know that:

$$ l(\theta^{(t+1)}) \geqslant \sum_{i=1}^m \sum_{z^{(i)}} Q_i(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t+1)})}{Q_i(z^{(i)})}\right) $$

for any \(Q\). Therefore:

$$ l(\theta^{(t+1)}) \geqslant \sum_{i=1}^m \sum_{z^{(i)}} Q_i^{(t)}(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t+1)})}{Q_i^{(t)}(z^{(i)})}\right) $$

Note that \(\theta^{(t+1)}\) is explicitly chosen so that:

$$ \sum_{i=1}^m \sum_{z^{(i)}}Q_i^{(t)}(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t+1)})}{Q_i^{(t)}(z^{(i)})}\right) \geqslant \sum_{i=1}^m \sum_{z^{(i)}}Q_i^{(t)}(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t)})}{Q_i^{(t)}(z^{(i)})}\right) $$

Therefore:

$$ \begin{eqnarray} l(\theta^{(t+1)}) &\geqslant& \sum_{i=1}^m \sum_{z^{(i)}}Q_i^{(t)}(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)};\theta^{(t)})}{Q_i^{(t)}(z^{(i)})}\right) \\ &=& l(\theta^{(t)}) \end{eqnarray} $$

Hence, the likelihood of the model always increases.

Mixture of Gaussians

Let us derive the EM equations for the Mixture of Gaussians model. In the E-step we do the following:

$$ w^{(i)}_j = Q_i(z^{(i)}=j) = P(z^{(i)} =j \vert x^{(i)};\phi_j,\Sigma_j,\mu_j) $$

In the M-step we update the parameters of the model by maximizing the following:

$$ \begin{eqnarray} l(\phi,\Sigma,\mu) &=& \sum_{i=1}^m\sum_{z^{(i)}}Q_i(z^{(i)})log\left(\frac{p(x^{(i)}\vert z^{(i)};\Sigma,\mu)p(z^{(i)};\phi)}{Q_i(z^{(i)})}\right)\\ &=& \sum_{i=1}^m\sum_{j=1}^kQ_i(z^{(i)}=j)log\left(\frac{p(x^{(i)}\vert z^{(i)}=j;\Sigma,\mu)p(z^{(i)}=j;\phi)}{Q_i(z^{(i)}=j)}\right)\\ &=& \sum_{i=1}^m\sum_{j=1}^kw^{(i)}_jlog\left(\frac{\frac{1}{(2\pi)^{n/2}\vert \Sigma_j \vert^{1/2}}exp\left(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j)\right)\phi_j}{w^{(i)}_j}\right)\\ \end{eqnarray} $$

Therefore:

$$ \begin{eqnarray} \nabla_{\mu_l}l(\phi,\Sigma,\mu) &=& \nabla_{\mu_l} \sum_{i=1}^m w^{(i)}_l \left(-\frac{1}{2}(x^{(i)}-\mu_l)^T\Sigma_l^{-1}(x^{(i)}-\mu_l)\right)\\ &=& -\frac{1}{2}\nabla_{\mu_l} \sum_{i=1}^m w^{(i)}_l \left((x^{(i)})^T\Sigma_l^{-1}(x^{(i)})-2\mu_l^T\Sigma_l^{-1}x^{(i)}-\mu_l^T\Sigma_l^{-1}\mu_l)\right)\\ &=& -\frac{1}{2}\sum_{i=1}^m w^{(i)}_l \Sigma_l^{-1}(x^{(i)}-\mu_j) \end{eqnarray} $$

Setting this to zero gives:

$$ \phi_l = \frac{\sum_{i=1}^m w^{(i)}_lx^{(i)}}{\sum_{i=1}^m w^{(i)}_l} $$

Also:

$$ \begin{eqnarray} \nabla_{\Sigma_{l}} l(\phi,\Sigma,\mu) &=& \sum_{i=1}^mw^{(i)}_l \nabla_{\Sigma_{l}} log\left(\frac{\frac{1}{(2\pi)^{n/2}\vert \Sigma_l \vert^{1/2}}exp\left(-\frac{1}{2}(x^{(i)}-\mu_l)^T\Sigma_l^{-1}(x^{(i)}-\mu_l)\right)\phi_l}{w^{(i)}_l}\right)\\ &=& \sum_{i=1}^mw^{(i)}_l \left(-\frac{1}{2}\Sigma^{-T}-\Sigma^{-T}(x^{(i)}-\mu_l)(x^{(i)}-\mu_l)^T\Sigma^{-T}\right) \end{eqnarray} $$

Note that the derivation above is similar to that for Gaussian Discriminant Analysis.

Setting this to zero gives:

$$ \Sigma_l = \frac{\sum_{i=1}^mw^{(i)}_l(x^{(i)}-\mu_l)(x^{(i)}-\mu_l)^T}{\sum_{i=1}^mw^{(i)}_l} $$

Lastly:

$$ \nabla_{\phi_l}l(\phi,\Sigma,\mu) = \nabla_{\phi_l}\sum_{i=1}^m\sum_{j=1}^k w^{(i)}_jlog(\phi_j) $$

However, note that we must ensure that \(\sum_j \phi_j = 1\). We may thus construct a Lagrangian for the above equation subject to this condition as follows:

$$ \mathcal{L}(\phi,\Sigma,\mu) = \sum_{i=1}^m\sum_{j=1}^k w^{(i)}_jlog(\phi_j) + \beta(\sum_{j=1}^k\phi_j-1) $$

Therefore:

$$ \nabla_{\phi_l} = \sum_{i=1}^m w^{(i)}_l\frac{1}{\phi_l} + \beta $$

Setting this to zero gives:

$$ \phi_l = \frac{\sum_{i=1}^m w^{(i)}_l}{-\beta} $$

But we know that \(\sum_j \phi_j=1\). Therefore \(-\beta=\sum_{i=1}^m\sum_{j=1}^k w^{(i)}_j\). And as \(-\beta\) \(=\) \(\sum_{j=1}^k\) \(w^{(i)}_j=\) \(\sum_{j=1}^k\) \(Q_i(z^{(i)}=j)=1\), we have \(-\beta = m\). So:

$$ \phi_l = \frac{1}{m}\sum_{i=1}^m w^{(i)}_l $$