Independent Component Analysis
Let \(D=\{s^{(1)},...,s^{(m)}\}\), where \(s \in \mathbb{R}^n\), be a set of examples generated via \(n\) independent sources and let the \(S = \{x^{(1)},....,x^{(m)}\}\), where \(x^{(i)}=As^{(i)}\) and \(x^{(i)} \in \mathbb{R}^n\), be the set of examples observed by \(n\) different observers. \(A \in \mathbb{R}^{n \times n}\) is called as the mixing matrix and is unknown. Note that \(s^{(i)}_j\) is the \(i^{th}\) example generated by source \(j\) and \(x^{(i)}_j\) is the \(i^{th}\) example observed by the \(j^{th}\) observer. The goal of the Independent Component Analysis (ICA) algorithm is to recover the set \(D\) given the set \(S\). In the subsequent discussion we will let \(W = A^{-1}\) denote the unmixing matrix and \(w_i^T\) denote the \(i^{th}\) row of \(W\). Our goal thus would be to find \(W\).
Ambiguities
It turns out that as long as long as the \(s^{(i)}\)’s have a non-Gaussian distibution, the following are the only sources of ambiguities in recovering \(W\). However, for most applications it turns out that these ambiguities do not matter much:
- Let \(P\) denote any \(n \times n\) permutation matrix. A permutation matrix has exactly one \(1\) in each column and row. It can be easily verified that \(Px^{(i)}\) has the same elements as \(x^{(i)}\) in a different order. Note that given only the set \(S\) there is no way to distinguish between \(W\) and \(WP\).
- If the \(j^{th}\) column of \(A\) was scaled by \(\alpha\) and the corresponding \(s^{(i)}_j\) scaled by \(1/\alpha\), then given only the set \(S\) there is no way to find the value of \(\alpha\).
If the \(s^{(i)}\)’s are Gaussian then we also have the following ambiguity (we drop the superscript \((i)\) for conciseness): If \(s \sim \mathcal{N}(0,I)\), then \(x\) also has Gaussian distribution with zero mean. Therefore (see this):
Now let \(R\) be an arbitrary orthogonal matrix (i.e. \(R^TR=RR^T=1\)) and let \(A'=R\). Then if \(x'=A's\), we have:
Therefore, both \(x\) and \(x'\) have the distribution \(\mathcal{N}(0,AA^T)\). Hence, there is no way to differentiate between the two cases and subsequently find the ‘right’ \(W\). Consequently, the ICA algorithm only works for non-Gaussian \(s^{(i)}\)’s.
The ICA Algorithm
It can be shown that if \(s^{(i)}\) is vector-valued with distribution \(p_s\) and \(x^{(i)}=As^{(i)}\) for a square and invertible matrix \(A\) then:
Let the joint distribution of all \(s^{(i)}\)’s be \(p(s) = \prod_{i=1}^m p_s(s^{(i)})\). Note that we have assumed that all \(s^{(i)}\)’s are independent. The joint probability of all \(x^{(i)}\)’s is then given by:
Let us assume that the cumulative distribution function of \(s\) is the sigmoid function \(g(s)= \frac{1}{1+exp(-s)}\). The probability density function of \(s\), \(p(s)\), is thus given by \(g'(s)\). Note that \(g'(z) =g(z)(1-g(z))\). We also define our log likelihood function to be equal to the log of \(p(x)\). Therefore:
Maximizing the first term with respect to \(w_k^T\) (i.e. the \(k^{th}\) row of \(W\)) gives:
Also:
Therefore:
We may now use the stochastic gradient descent algorithm to update \(W\).