Independent Component Analysis

Let $D=\{s^{(1)},...,s^{(m)}\}$, where $s \in \mathbb{R}^n$, be a set of examples generated via $n$ independent sources and let the $S = \{x^{(1)},....,x^{(m)}\}$, where $x^{(i)}=As^{(i)}$ and $x^{(i)} \in \mathbb{R}^n$, be the set of examples observed by $n$ different observers. $A \in \mathbb{R}^{n \times n}$ is called as the mixing matrix and is unknown. Note that $s^{(i)}_j$ is the $i^{th}$ example generated by source $j$ and $x^{(i)}_j$ is the $i^{th}$ example observed by the $j^{th}$ observer. The goal of the Independent Component Analysis (ICA) algorithm is to recover the set $D$ given the set $S$. In the subsequent discussion we will let $W = A^{-1}$ denote the unmixing matrix and $w_i^T$ denote the $i^{th}$ row of $W$. Our goal thus would be to find $W$.

Ambiguities

It turns out that as long as long as the $s^{(i)}$’s have a non-Gaussian distibution, the following are the only sources of ambiguities in recovering $W$. However, for most applications it turns out that these ambiguities do not matter much:

Let $P$ denote any $n \times n$ permutation matrix. A permutation matrix has exactly one $1$ in each column and row. It can be easily verified that $Px^{(i)}$ has the same elements as $x^{(i)}$ in a different order. Note that given only the set $S$ there is no way to distinguish between $W$ and $WP$.
If the $j^{th}$ column of $A$ was scaled by $\alpha$ and the corresponding $s^{(i)}_j$ scaled by $1/\alpha$, then given only the set $S$ there is no way to find the value of $\alpha$.

If the $s^{(i)}$’s are Gaussian then we also have the following ambiguity (we drop the superscript $(i)$ for conciseness): If $s \sim \mathcal{N}(0,I)$, then $x$ also has Gaussian distribution with zero mean. Therefore (see this):

$$ Cov(x)=\mathbb{E}(xx^T) =\mathbb{E}[Ass^TA^T] = AA^T $$

Now let $R$ be an arbitrary orthogonal matrix (i.e. $R^TR=RR^T=1$) and let $A'=R$. Then if $x'=A's$, we have:

$$ Cov(x')=\mathbb{E}(x'x'^T) =\mathbb{E}[ARss^TR^TA^T]=ARR^TA^T=AA^T $$

Therefore, both $x$ and $x'$ have the distribution $\mathcal{N}(0,AA^T)$. Hence, there is no way to differentiate between the two cases and subsequently find the ‘right’ $W$. Consequently, the ICA algorithm only works for non-Gaussian $s^{(i)}$’s.

The ICA Algorithm

It can be shown that if $s^{(i)}$ is vector-valued with distribution $p_s$ and $x^{(i)}=As^{(i)}$ for a square and invertible matrix $A$ then:

$$ p_x(x^{(i)}) = p_s(Wx^{(i)})\vert W \vert $$

Let the joint distribution of all $s^{(i)}$’s be $p(s) = \prod_{i=1}^m p_s(s^{(i)})$. Note that we have assumed that all $s^{(i)}$’s are independent. The joint probability of all $x^{(i)}$’s is then given by:

$$ p(x) = \prod_{i=1}^m \left(\prod_{j=1}^n p_s(w^T_jx^{(i)})\right) \vert W \vert $$

Let us assume that the cumulative distribution function of $s$ is the sigmoid function $g(s)= \frac{1}{1+exp(-s)}$. The probability density function of $s$, $p(s)$, is thus given by $g'(s)$. Note that $g'(z) =g(z)(1-g(z))$. We also define our log likelihood function to be equal to the log of $p(x)$. Therefore:

$$ \begin{eqnarray} l(W) &=& log \prod_{i=1}^m\left(\prod_{j=1}^n g'(w^T_jx^{(i)})\right)\vert W \vert\\ &=& \sum_{i=1}^m\left(\sum_{j=1}^n log\left(g'(w^T_jx^{(i)})\right) + log\vert W \vert\right)\\ \end{eqnarray} $$

Maximizing the first term with respect to $w_k^T$ (i.e. the $k^{th}$ row of $W$) gives:

$$ \begin{eqnarray} \frac{\partial}{\partial w_k^T}\sum_{j=1}^n log\left(g'(w^T_jx^{(i)})\right) &=& \sum_{j=1}^n\frac{1}{g'(w^T_kx^{(i)})}\frac{\partial}{\partial w_k^T}\left(g(w^T_kx^{(i)})(1-g(w^T_kx^{(i)})\right)\\ &=& \sum_{j=1}^n\frac{1}{g'(w^T_jx^{(i)})}\left(g(w^T_kx^{(i)})\frac{\partial}{\partial w_k^T}(1-g(w^T_kx^{(i)}))+(1-g(w^T_kx^{(i)}))\frac{\partial}{\partial w_k}g(w^T_kx^{(i)})\right)\\ &=&\sum_{j=1}^n\left(1-2g(w^T_kx^{(i)})\right)\frac{\partial}{\partial w_k^T}(w^T_jx^{(i)})\\ &=&\left(1-2g(w^T_kx^{(i)})\right)(x^{(i)})^T \end{eqnarray} $$

Also:

$$ \begin{eqnarray} \nabla_W log(\vert W \vert) &=& \frac{1}{\vert W \vert} \nabla_W \vert W \vert\\ &=& \frac{1}{\vert W \vert} \vert W \vert (W^T)^{-1}\\ &=& (W^T)^{-1} \end{eqnarray} $$

Therefore:

$$ \nabla_W l(W) = \sum_{i=1}^m \left(\begin{bmatrix} 1-2g(w^T_1x^{(i)}) \\ 1-2g(w^T_2x^{(i)}) \\.\\.\\. \\ 1-2g(w^T_nx^{(i)}) \end{bmatrix}(x^{(i)})^T + (W^T)^{-1}\right) $$

We may now use the stochastic gradient descent algorithm to update $W$.