Let \(D=\{s^{(1)},...,s^{(m)}\}\), where \(s \in \mathbb{R}^n\), be a set of examples generated via \(n\) independent sources and let the \(S = \{x^{(1)},....,x^{(m)}\}\), where \(x^{(i)}=As^{(i)}\) and \(x^{(i)} \in \mathbb{R}^n\), be the set of examples observed by \(n\) different observers. \(A \in \mathbb{R}^{n \times n}\) is called as the mixing matrix and is unknown. Note that \(s^{(i)}_j\) is the \(i^{th}\) example generated by source \(j\) and \(x^{(i)}_j\) is the \(i^{th}\) example observed by the \(j^{th}\) observer. The goal of the Independent Component Analysis (ICA) algorithm is to recover the set \(D\) given the set \(S\). In the subsequent discussion we will let \(W = A^{-1}\) denote the unmixing matrix and \(w_i^T\) denote the \(i^{th}\) row of \(W\). Our goal thus would be to find \(W\).

Ambiguities

It turns out that as long as long as the \(s^{(i)}\)’s have a non-Gaussian distibution, the following are the only sources of ambiguities in recovering \(W\). However, for most applications it turns out that these ambiguities do not matter much:

  1. Let \(P\) denote any \(n \times n\) permutation matrix. A permutation matrix has exactly one \(1\) in each column and row. It can be easily verified that \(Px^{(i)}\) has the same elements as \(x^{(i)}\) in a different order. Note that given only the set \(S\) there is no way to distinguish between \(W\) and \(WP\).
  2. If the \(j^{th}\) column of \(A\) was scaled by \(\alpha\) and the corresponding \(s^{(i)}_j\) scaled by \(1/\alpha\), then given only the set \(S\) there is no way to find the value of \(\alpha\).

If the \(s^{(i)}\)’s are Gaussian then we also have the following ambiguity (we drop the superscript \((i)\) for conciseness): If \(s \sim \mathcal{N}(0,I)\), then \(x\) also has Gaussian distribution with zero mean. Therefore (see this):

$$ Cov(x)=\mathbb{E}(xx^T) =\mathbb{E}[Ass^TA^T] = AA^T $$

Now let \(R\) be an arbitrary orthogonal matrix (i.e. \(R^TR=RR^T=1\)) and let \(A'=R\). Then if \(x'=A's\), we have:

$$ Cov(x')=\mathbb{E}(x'x'^T) =\mathbb{E}[ARss^TR^TA^T]=ARR^TA^T=AA^T $$

Therefore, both \(x\) and \(x'\) have the distribution \(\mathcal{N}(0,AA^T)\). Hence, there is no way to differentiate between the two cases and subsequently find the ‘right’ \(W\). Consequently, the ICA algorithm only works for non-Gaussian \(s^{(i)}\)’s.

The ICA Algorithm

It can be shown that if \(s^{(i)}\) is vector-valued with distribution \(p_s\) and \(x^{(i)}=As^{(i)}\) for a square and invertible matrix \(A\) then:

$$ p_x(x^{(i)}) = p_s(Wx^{(i)})\vert W \vert $$

Let the joint distribution of all \(s^{(i)}\)’s be \(p(s) = \prod_{i=1}^m p_s(s^{(i)})\). Note that we have assumed that all \(s^{(i)}\)’s are independent. The joint probability of all \(x^{(i)}\)’s is then given by:

$$ p(x) = \prod_{i=1}^m \left(\prod_{j=1}^n p_s(w^T_jx^{(i)})\right) \vert W \vert $$

Let us assume that the cumulative distribution function of \(s\) is the sigmoid function \(g(s)= \frac{1}{1+exp(-s)}\). The probability density function of \(s\), \(p(s)\), is thus given by \(g'(s)\). Note that \(g'(z) =g(z)(1-g(z))\). We also define our log likelihood function to be equal to the log of \(p(x)\). Therefore:

$$ \begin{eqnarray} l(W) &=& log \prod_{i=1}^m\left(\prod_{j=1}^n g'(w^T_jx^{(i)})\right)\vert W \vert\\ &=& \sum_{i=1}^m\left(\sum_{j=1}^n log\left(g'(w^T_jx^{(i)})\right) + log\vert W \vert\right)\\ \end{eqnarray} $$

Maximizing the first term with respect to \(w_k^T\) (i.e. the \(k^{th}\) row of \(W\)) gives:

$$ \begin{eqnarray} \frac{\partial}{\partial w_k^T}\sum_{j=1}^n log\left(g'(w^T_jx^{(i)})\right) &=& \sum_{j=1}^n\frac{1}{g'(w^T_kx^{(i)})}\frac{\partial}{\partial w_k^T}\left(g(w^T_kx^{(i)})(1-g(w^T_kx^{(i)})\right)\\ &=& \sum_{j=1}^n\frac{1}{g'(w^T_jx^{(i)})}\left(g(w^T_kx^{(i)})\frac{\partial}{\partial w_k^T}(1-g(w^T_kx^{(i)}))+(1-g(w^T_kx^{(i)}))\frac{\partial}{\partial w_k}g(w^T_kx^{(i)})\right)\\ &=&\sum_{j=1}^n\left(1-2g(w^T_kx^{(i)})\right)\frac{\partial}{\partial w_k^T}(w^T_jx^{(i)})\\ &=&\left(1-2g(w^T_kx^{(i)})\right)(x^{(i)})^T \end{eqnarray} $$

Also:

$$ \begin{eqnarray} \nabla_W log(\vert W \vert) &=& \frac{1}{\vert W \vert} \nabla_W \vert W \vert\\ &=& \frac{1}{\vert W \vert} \vert W \vert (W^T)^{-1}\\ &=& (W^T)^{-1} \end{eqnarray} $$

Therefore:

$$ \nabla_W l(W) = \sum_{i=1}^m \left(\begin{bmatrix} 1-2g(w^T_1x^{(i)}) \\ 1-2g(w^T_2x^{(i)}) \\.\\.\\. \\ 1-2g(w^T_nx^{(i)}) \end{bmatrix}(x^{(i)})^T + (W^T)^{-1}\right) $$

We may now use the stochastic gradient descent algorithm to update \(W\).