Let \(S = \{x^{(1)},x^{(2)},...,x^{(m)}\}\) be a training set where \(x \in \mathbb{R}^n\) such that \(n>m\). Suppose that we wish to model \(S\) using a Gaussian distribution. Note that the covariance matrix, \(\Sigma\) \(\in\) \(\mathbb{R}^{n \times n}\), is given by:
$$ \Sigma = \frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T $$
Note that \((x^{(i)}-\mu)(x^{(i)}-\mu)^T\) is essentially:
$$ \begin{bmatrix} (x^{(i)}-\mu)_1(x^{(i)}-\mu)^T \\ (x^{(i)}-\mu)_2(x^{(i)}-\mu)^T \\ .\\ .\\ .\\ (x^{(i)}-\mu)_n(x^{(i)}-\mu)^T \end{bmatrix} $$
where \((x^{(i)}-\mu)_i\) is the \(i^{th}\) element of the vector \((x^{(i)}-\mu)\). Note that each row of the matrix above is equal to \((x^{(i)}-\mu)^T\) multiplied by some scalar number. Hence, the rank of this matrix is \(1\). The covariance matrix is thus a summation of \(m\) matrices, each of rank \(1\). Note that the \(m^{th}\) matrix is completely determined by the other \(m-1\) matrices. This can be shown in the following way:
$$ \begin{eqnarray} \mu &=& \frac{1}{m}\sum_{i=1}^mx^{(i)}\\ x^{(m)} &=& m\mu - \sum_{i=1}^{m-1}x^{(i)} \end{eqnarray} $$
Therefore:
$$ \Sigma = \frac{1}{m}\left(\sum_{i=1}^{m-1}(x^{(i)}-\mu)(x^{(i)}-\mu)^T + (x^{(m)}-\mu)(x^{(m)}-\mu)^T\right) $$
where \(x^{(m)}=m\mu - \sum_{i=1}^{m-1}x^{(i)}\). Therefore the covariance matrix is a sum of \(m-1\) linearly independent matrices of rank \(1\) and another matrix (of rank \(1\)) that can be completely determined by the other \(m-1\) matrices. We know that the rank of the sum of \(m-1\) linearly independent matrices is at most the sum of the rank of the \(m-1\) matrices (see this). Therefore, the rank of the covariance matrix, in this case, is at most \(m-1\). However, because \(n>m\), the covariance matrix (of size \(n \times n\)) does not have full rank (i.e. not all rows/columns of \(\Sigma\) are linearly independent). Hence, \(\Sigma\) is singular, or more specifically \(\vert \Sigma \vert = 0\). Therefore, \(\Sigma^{-1}\) does not exist.
The probability density of a random variable \(X\) that has a Gaussian distribution is given by:
$$ p(X=x) = \frac{1}{(2\pi)^{n/2}\vert\Sigma\vert^{1/2}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1} (x-mu)) $$
However, because \(\Sigma^{-1}\) does not exist and \(\frac{1}{\vert\Sigma\vert^{1/2}}\) cannot be evaluated, the probability \(p(X=x)\) is not defined.
1. Restrictions on \(\Sigma\)
In order to make sure that the covariance matrix, \(\Sigma\), is not singular we may put one of the following restrictions on it:
- \(\Sigma\) should be diagonal matrix, where \(\Sigma_{jj} = \frac{1}{m}\sum_{i=1}^m (x^{(i)}_j-\mu_j)^2\) for \(j=1,...,n\).
- \(\Sigma\) should be equal to \(\sigma^2I\), where \(I\) is an \(n\times n\) identity matrix and \(\sigma=\frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n (x^{(i)}_j-\mu_j)^2\).
Note that while we may obtain a non-singular covariance matrix under either of the two conditions above for \(m \geqslant 2\) (m=1 would mean that the term \((x_j-\mu_j)\) is just \(0\) because the mean would equal the only example in our training set), restricting \(\Sigma\) to a diagonal matrix would imply independence and the absence of correlation between different dimensions of \(x\), which is usually not the case. Below, we describe the factor analysis model that tries to avoid this problem.
2. The Factor Analysis Model
Let \(S = \{x^{(1)},...,x^{(m)}\}\), where \(x^{(i)} \in \mathbb{R}^{n}\), be a set of training examples. The factor analysis model models the following distribution (in the ensuing discussion we will drop the superscript \((i)\) on \(x\) for conciseness when there is no fear of ambiguity):
$$ \begin{eqnarray} z &\sim& \mathcal{N}(0,I)\\ \epsilon &\sim& \mathcal{N}(0,\Psi)\\ x &=& \mu + \Lambda z + \epsilon \end{eqnarray} $$
where \(z \in \mathbb{R}^k\), \(\epsilon \in R^{n}\) are random variables and \(\Lambda \in \mathbb{R}^{n \times k}\), \(\mu \in \mathbb{R}^n\) and \(\Psi \in \mathbb{R}^{n \times n}\).
Consider the equation for \(x\). Suppose that we hold \(z\) constant. Then:
$$ \begin{eqnarray} \mathbb{E}[x \vert z] &=& \mathbb{E}[\mu + \Lambda z + \epsilon]\\ &=& \mathbb{E}[\mu + \Lambda z] + \mathbb{E}[\epsilon]\\ &=& \mu + \Lambda z \end{eqnarray} $$
Note that \(\mathbb{E}[x \vert z]\) indicates that \(z\) has been fixed to some value. Also note that the last step follows from the fact that \(z\) is fixed and so the term \(\mu + \Lambda z\) is constant (and the expectation of a constant is just that constant). Also:
$$ \begin{eqnarray} Cov(x \vert z) &=& Cov(\mu + \Lambda z + \epsilon)\\ &=& Cov(\mu + \Lambda z) + Cov(\epsilon)\\ &=& \Psi \end{eqnarray} $$
Note that the last step follows from the fact that the covariance of a constant is just zero. (To see this consider: \(Cov(a) = \mathbb{E}[(a-\mathbb{E}[a])(a-\mathbb{E}[a])^T]\) \(=\mathbb{E}[(a-a)(a-a)^T]=0\), where \(a\) is a constant).
Therefore, the factor analysis model can equivalently be written as:
$$ \begin{eqnarray} z &\sim& \mathcal{N}(0,I)\\ x \vert z &\sim& \mathcal{N}(\mu + \Lambda z, \Psi) \end{eqnarray} $$
However, in deriving the factor analysis model we will use the initial formulation of the model. Let us define the joint distribution for \(x\) and \(z\) as follows:
$$ \begin{bmatrix} z \\ x \end{bmatrix} \sim \mathcal{N}(\mu_{zx}, \Sigma) $$
where \(\mu_{zx} \in \mathbb{R}^{n+k}\), \(\Sigma \in \mathbb{R}^{(n+k) \times (n+k)}\). Note that:
$$ \begin{eqnarray} \mathbb{E}[x] &=& \mathbb{E}[\mu + \lambda z + \epsilon]\\ &=& \mu \end{eqnarray} $$
Note that \(z\) is no longer fixed. Therefore:
$$ \mu_{zx} = \begin{bmatrix} \vec{0} \\ \mu\end{bmatrix} $$
Consider the following:
$$ \begin{eqnarray} Cov(z) &=& \mathbb{E}[(z-\mathbb{E}[z])(z-\mathbb{E}[z])^T]\\ &=& \mathbb{E}[zz^T] \end{eqnarray} $$
Similarly \(Cov(\epsilon)=\mathbb{E}[\epsilon\epsilon^T]\). In order to derive the joint covariance matrix \(\Sigma\) we would need \(\Sigma_{zz}\), \(\Sigma_{zx}\), \(\Sigma_{xz}\) and \(\Sigma_{zz}\) (see this). Clearly \(\Sigma_{zz}\) \(= I\). Also:
$$ \begin{eqnarray} \Sigma_{zx} &=& \mathbb{E}[(z-\mathbb{E}[z])(x-\mathbb{E}[x])^T]\\ &=& \mathbb{E}[(z-\mathbb{E}[z])(\mu+\Lambda z + \epsilon - \mu)^T]\\ &=& \mathbb{E}[(z)(\Lambda z + \epsilon)^T]\\ &=& \mathbb{E}[(zz^T\Lambda^T + z\epsilon^T]\\ &=& \mathbb{E}[zz^T]\Lambda^T + \mathbb{E}[z]\mathbb{E}[\epsilon^T]\\ &=& Cov(z)\Lambda^T\\ &=& \Lambda^T \end{eqnarray} $$
The third-to-last step used the fact that \(z\) and \(\epsilon\) are independent and so \(\mathbb{E}[z\epsilon]=\mathbb{E}[z]\mathbb{E}[\epsilon]\). Similarly, we can show that \(\Sigma_{xz}=\Lambda\). Lastly:
$$ \begin{eqnarray} \Sigma_{xx} &=& \mathbb{E}[(x-\mathbb{E}[x])(x-\mathbb{E}[x])^T]\\ &=& \mathbb{E}[(\Lambda z + \epsilon)(\Lambda z + \epsilon)^T]\\ &=& \mathbb{E}[\Lambda zz^T \Lambda^T + \Lambda z\epsilon^T + \epsilon z^T \Lambda + \epsilon \epsilon^T]\\ &=& \Lambda \Lambda^T + \Psi \end{eqnarray} $$
Hence:
$$ \begin{bmatrix} z \\ x \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \vec{0} \\ \mu \end{bmatrix}, \begin{bmatrix} I & \Lambda^T \\ \Lambda & \Lambda\Lambda^T + \Psi \end{bmatrix}\right) $$
From this it follows that \(x \sim \mathcal{N}(\mu, \Lambda\Lambda^T+\Psi)\). Therefore the log likelihood is given by:
$$ l(\mu,\Lambda,\Psi) = log\prod_{i=1}^m \frac{1}{(2\pi)^{n/2}\vert \Lambda\Lambda^T + \Psi \vert^{1/2}}exp\left(-\frac{1}{2}(x^{(i)}-\mu)^T(\Lambda\Lambda^T + \Psi)^{-1}(x^{(i)}-\mu)\right) $$
We shall use the Expectation-Maximization Algorithm to find optimal values for the parameters of this model.
For the E-step we need to set the value of \(Q_i(z^{(i)})\) to \(p(z^{(i)} \vert x^{(i)})\). Note that \(Q_i(z^{(i)}) \sim \mathcal{N}(\mu_{z^{(i)} \vert x^{(i)}},\) \(\Sigma_{z^{(i)} \vert x^{(i)}})\) where \(\mu_{z^{(i)} \vert x^{(i)}}\) and \(\Sigma_{z^{(i)} \vert x^{(i)}}\) are given by (see this for why this is so):
$$ \begin{eqnarray} \mu_{z^{(i)} \vert x^{(i)}} &=& \mu_{z^{(i)}} + \Sigma_{z^{(i)}x^{(i)}}\Sigma_{x^{(i)}x^{(i)}}^{-1}(x^{(i)}-\mu_{x^{(i)}})\\ &=&\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}-\mu)\\ \Sigma_{z^{(i)} \vert x^{(i)}} &=& \Sigma_{z^{(i)}z^{(i)}}-\Sigma_{z^{(i)}x^{(i)}}\Sigma_{x^{(i)}x^{(i)}}^{-1}\Sigma_{x^{(i)}z^{(i)}}\\ &=&I - \Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}\Lambda \end{eqnarray} $$
For the M-step we need to maximize the following:
$$ \begin{eqnarray} f(\mu,\Lambda,\Psi) &=& \sum_{i=1}^m\int_{z^{(i)}}Q_i(z^{(i)})log\left(\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}\right)\\ &=& \sum_{i=1}^m\int_{z^{(i)}}Q_i(z^{(i)})log\left(\frac{p(x^{(i)} \vert z^{(i)})p(z^{(i)})}{Q_i(z^{(i)})}\right)\\ &=& \sum_{i=1}^m\int_{z^{(i)}}Q_i(z^{(i)})\left(log\left(p(x^{(i)} \vert z^{(i)})\right)+log\left(p(z^{(i)})\right)-log\left({Q_i(z^{(i)})}\right)\right) \end{eqnarray} $$
Dropping all terms that do not depend on \(\mu\), \(\Lambda\) and \(\Psi\) gives (note that \(Q_i(z^{(i)})\) is just some number because it was explicitly evaluated during the E-step):
$$ \begin{eqnarray} f(\mu,\Lambda,\Psi)&=&\sum_{i=1}^m\int_{z^{(i)}}Q_i(z^{(i)})log\left(p(x^{(i)} \vert z^{(i)})\right)\\ &=& \sum_{i=1}^m \mathbb{E}_{z^{(i)} \sim Q_i}\left[log\left(p(x^{(i)}\vert z^{(i)})\right)\right]\\ &=& \sum_{i=1}^m \mathbb{E}_{z^{(i)}\sim Q_i}\left[log\left(\frac{1}{(2\pi)^{n/2}\vert\Psi\vert^{1/2}}exp\left(-\frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right)\right)\right]\\ &=& \sum_{i=1}^m \mathbb{E}_{z^{(i)}\sim Q_i}\left[log\left(\frac{1}{(2\pi)^{n/2}}\right) - \frac{1}{2}log\vert \Psi \vert -\frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right]\\ &=& \sum_{i=1}^m \mathbb{E}_{z^{(i)}\sim Q_i}\left[- \frac{1}{2}log\vert \Psi \vert - \frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right]\\ \end{eqnarray} $$
In the last step above, we have dropped all terms that do not depend on \(\mu\), \(\Lambda\) and \(\Psi\). Note that \(z^{(i)} \sim Q_i\) means that \(z^{(i)}\) is being sampled from the probability distribution \(Q_i\), which in this case is just equal to the distribution of \(z^{(i)}\) conditioned on \(x^{(i)}\) (i.e. \(z^{(i)} \sim \mathcal{N}(\mu_{z^{(i)} \vert x^{(i)}},\Sigma_{z^{(i)} \vert x^{(i)}})\)).
We shall now maximize \(f\) with respect to \(\Lambda\) by only considering the terms that depend on \(\Lambda\). We will be making use of the properties of a trace of a matrix (see this). We will also drop the subscript in \(\mathbb{E}_{z^{(i)}\sim Q_i}\) for convenience.
$$ \begin{eqnarray} \nabla_\Lambda f(\mu,\Lambda,\Psi) &=& -\frac{1}{2}\sum_{i=1}^m \mathbb{E}\left[ \nabla_\Lambda\left((z^{(i)})^T\Lambda^T\Psi^{-1}\Lambda z^{(i)}-(z^{(i)})^T\Lambda^T\Psi^{-1}(x^{(i)}-\mu)-(x^{(i)}-\mu)^T\Psi^{-1}\Lambda z^{(i)}\right)\right]\\ &=& -\frac{1}{2}\sum_{i=1}^m \mathbb{E}\left[\nabla_\Lambda tr\left((z^{(i)})^T\Lambda^T\Psi^{-1}\Lambda z^{(i)}\right)-2\nabla_\Lambda tr\left((z^{(i)})^T\Lambda^T\Psi^{-1}(x^{(i)}-\mu)\right)\right]\\ &=& -\frac{1}{2}\sum_{i=1}^m \mathbb{E}\left[\nabla_{\Lambda^T} tr\left(\Lambda\Psi^{-1}\Lambda^T z^{(i)}(z^{(i)})^T\right)-2\nabla_{\Lambda^T} tr\left(\Lambda\Psi^{-1}(x^{(i)}-\mu)(z^{(i)})^T\right)\right]\\ &=& -\frac{1}{2}\sum_{i=1}^m \mathbb{E}\left[\Psi^{-T}\Lambda z^{(i)}(z^{(i)})^T+\Psi^{-1}\Lambda z^{(i)}(z^{(i)})^T-2\Psi^{-1}(x^{(i)}-\mu)(z^{(i)})^T\right]\\ &=&\sum_{i=1}^m \mathbb{E}\left[-\Psi^{-1}\Lambda z^{(i)}(z^{(i)})^T+\Psi^{-1}(x^{(i)}-\mu)(z^{(i)})^T\right] \end{eqnarray} $$
The first and second steps used the fact that \(a=tr(a)\) when \(a\) is scalar, the properties \(tr(AB)=\) \(tr(BA)\) and \(tr(A^T)=tr(A)\), a simple change in variable and the fact that \(\Psi\) is symmetric and so \(\Psi^T=\Psi\). The third step used three properties: \(\nabla_{A}tr(ABA^TC)\) \(=CAB+C^TAB^T\), \(\nabla_{A^T}f(A)=(\nabla_{A}f(A))^T\) and \(\nabla_A tr(AB)= B^T\).
Setting \(\nabla_\Lambda f(\mu,\Lambda, \Psi)\) to zero gives:
$$ \begin{eqnarray} 0 &=& \sum_{i=1}^m \mathbb{E}\left[-\Psi^{-1}\Lambda z^{(i)}(z^{(i)})^T+\Psi^{-1}(x^{(i)}-\mu)(z^{(i)})^T\right]\\ \sum_{i=1}^m \Psi^{-1}\Lambda \mathbb{E}\left[z^{(i)}(z^{(i)})^T\right] &=& \sum_{i=1}^m \Psi^{-1}(x^{(i)}-\mu) \mathbb{E}\left[(z^{(i)})^T\right]\\ \Lambda &=& \sum_{i=1}^m (x^{(i)}-\mu) \mathbb{E}_{z^{(i)}\sim Q_i} \left[(z^{(i)})^T\right]\left(\sum_{i=1}^m\mathbb{E}_{z^{(i)}\sim Q_i}\left[z^{(i)}(z^{(i)})^T\right]\right)^{-1}\\ \end{eqnarray} $$
We know that \(\mathbb{E}_{z^{(i)}\sim Q_i}[(z^{(i)})^T]=\mu_{z^{(i)} \vert x^{(i)}}^T\). Also we know that \(Cov(z^{(i)})=\mathbb{E}[z^{(i)}(z^{(i)})^T]-\) \(\mathbb{E}[z^{(i)}]\mathbb{E}[(z^{(i)})^T]\) [see this] and so \(\mathbb{E}_{z^{(i)}\sim Q_i}[z^{(i)}(z^{(i)})^T]=Cov(z^{(i)})\) \(+\mathbb{E}_{z^{(i)}\sim Q_i}[z^{(i)}]\mathbb{E}[(z^{(i)})^T]\) \(=\Sigma_{z^{(i)}\vert x^{(i)}}+\mu_{z^{(i)} \vert x^{(i)}}\mu_{z^{(i)} \vert x^{(i)}}^T\). Therefore:
$$ \Lambda = \sum_{i=1}^m (x^{(i)}-\mu) \mu_{z^{(i)} \vert x^{(i)}}^T\left(\sum_{i=1}^m\Sigma_{z^{(i)}\vert x^{(i)}}+\mu_{z^{(i)} \vert x^{(i)}}\mu_{z^{(i)} \vert x^{(i)}}^T\right)^{-1}\\ $$
We shall now derive the update for \(\mu\):
$$ \begin{eqnarray} \nabla_\mu f(\mu,\Lambda,\Psi) &=& -\frac{1}{2}\nabla_\mu\sum_{i=1}^m\mathbb{E}\left[-\mu\Psi^{-1}(x^{(i)}-\Lambda z^{(i)})+\mu^T\Psi^{-1}\mu-(x^{(i)}-\Lambda z^{(i)})^T\Psi^{-1}\mu\right]\\ &=& -\frac{1}{2}\sum_{i=1}^m\mathbb{E}\left[-2\nabla_\mu tr\left(\mu\Psi^{-1}(x^{(i)}-\Lambda z^{(i)})\right)+\nabla_\mu tr\left(\mu^T\Psi^{-1}\mu\right)\right]\\ &=& -\frac{1}{2}\sum_{i=1}^m\mathbb{E}\left[-\Psi^{-1}(x^{(i)}-\Lambda z^{(i)})+\Psi^{-1}\mu \right]\\ \end{eqnarray} $$
The reader is referred to this post for the properties of traces that have been used in the derivation above. Setting the above to zero gives:
$$ \begin{eqnarray} 0 &=& -\frac{1}{2}\sum_{i=1}^m\mathbb{E}\left[-\Psi^{-1}(x^{(i)}-\Lambda z^{(i)})+\Psi^{-1}\mu \right]\\ \sum_{i=1}^m \mu &=& \sum_{i=1}^m\mathbb{E}\left[x^{(i)}-\Lambda z^{(i)}\right]\\ \mu &=& \frac{1}{m}\left(\sum_{i=1}^m\mathbb{E}\left[x^{(i)}\right]-\Lambda\sum_{i=1}^m\mathbb{E}\left[ z^{(i)}\right]\right)\\ &\approx& \frac{1}{m}\sum_{i=1}^m\mathbb{E}\left[x^{(i)}\right] \end{eqnarray} $$
The last step assumes that \(\sum_{i=1}^m \mathbb{E}[z^{(i)}] \approx 0\) if \(m\) is very large because the prior distribution of \(z\) was defined to be a Gaussian with zero mean.
Let us now derive the update for \(\Psi\). We have (see the derivation of \(\Sigma\) for the Gaussian Discriminant Analysis in this post):
$$ \nabla_{\Psi}f(\mu,\Lambda,\Psi) = -\frac{1}{2}\sum_{i=1}^m\mathbb{E}\left[\Psi^{-1}-\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}\right] $$
Setting this to zero gives:
$$ \begin{eqnarray} 0 &=& -\frac{1}{2}\sum_{i=1}^m\mathbb{E}\left[\Psi^{-1}-\Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})(x^{(i)}-\mu-\Lambda z^{(i)})^T\Psi^{-1}\right]\\ \Psi &=& \frac{1}{m}\sum_{i=1}^m\mathbb{E}\left[(x^{(i)}-\mu-\Lambda z^{(i)})(x^{(i)}-\mu-\Lambda z^{(i)})^T\right]\\ &=& \frac{1}{m}\sum_{i=1}^m\mathbb{E}\left[(x^{(i)}-\mu)(x^{(i)}-\mu)^T -\Lambda z^{(i)}(x^{(i)}-\mu)^T -(x^{(i)}-\mu)(z^{(i)})^T\Lambda^T+\Lambda z^{(i)}(z^{(i)})^T\Lambda^T\right]\\ &=& \frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T -\Lambda \mathbb{E}\left[z^{(i)}\right](x^{(i)}-\mu)^T -(x^{(i)}-\mu)\mathbb{E}\left[(z^{(i)})^T\right]\Lambda^T+\Lambda \mathbb{E}\left[z^{(i)}(z^{(i)})^T\right]\Lambda^T\\ &=& \frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T -\Lambda \mathbb{E}\left[z^{(i)}\right](x^{(i)}-\mu)^T \end{eqnarray} $$
The last step used the equation we had reached upon earlier when deriving the update for \(\Lambda\), i.e.:
$$ \sum_{i=1}^m \Lambda \mathbb{E}\left[z^{(i)}(z^{(i)})^T\right] = \sum_{i=1}^m (x^{(i)}-\mu) \mathbb{E}\left[(z^{(i)})^T\right] $$
Therefore, the update for \(\Psi\) is given by:
$$ \Psi = \frac{1}{m}diag\left\{\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T -\Lambda \mu_{z^{(i)} \vert x^{(i)}}(x^{(i)}-\mu)^T\right\} $$
Note that we restrict \(\Psi\) to be a diagonal matrix here.