Learning Theory

Let us start with the following two lemmas:

Lemma (The Union Bound). Let $A_1,A_2,...,A_k$ be $k$ (not necessarily independent) events. Then:

$$ P(A_1 \cup ... \cup A_k) \leqslant P(A_1) + ... + P(A_k) $$

Lemma (The Hoeffding Inequality/Chernoff Bound). Let $Z_1,...,Z_m$ be $m$ independent and identically distributed (iid) random variables drawn from a Bernoulli($\phi$) distribution. Let $\hat{\phi}=\frac{1}{m}\sum_{i=1}^{m}Z_i$ be the mean of these random variables and let any $\gamma > 0$ be fixed. Then:

$$ P(\vert \phi-\hat{\phi} \vert > \gamma) \leqslant 2\exp(-2\gamma^2m) $$

Suppose that we are given a training set $S = \{(x^{(i)},y^{(i)}); i=1,...,m\}$ where $y^{(i)} \in \{0,1\}$ and the training examples $(x^{(i)},y^{(i)})$ are drawn iid from some random distribution $\mathcal{D}$. Let us define the training error or the empirical risk of a hypothesis $h$ to be:

$$ \hat{\mathcal{E}}(h) = \frac{1}{m}\sum_{i=1}^m Z_i $$

where $Z_i$ is the event where a training example $x^{(i)},y^{(i)}$ is drawn and $h(x^{(i)}) \neq y^{(i)}$ Let us also define the generalization error to be:

$$ \mathcal{E}(h) = P_{(x,y)\sim\mathcal{D}}(h(x) \neq y) $$

Let $\mathcal{H}$ be the set of all hypothesis considered by the learning algorithm. The goal of empirical risk minimization is to find the hypothesis $\hat{h}$ such that

$$ \hat{h} = \underset{h}{argmin}\ \hat{\mathcal{E}}(h) $$

The Case of Finite $\mathcal{H}$

Suppose that $\vert \mathcal{H} \vert =k$. Let $h_i$ be a hypothesis from the set of $\mathcal{H}$. From the Hoeffding inequality we have:

$$ P(\vert \mathcal{E}(h_i)-\hat{\mathcal{E}}(h_i) \vert > \gamma) \leqslant 2\exp(-2\gamma^2m) $$

Let $A_i$ be the event that $\vert \mathcal{E}(h_i)-\hat{\mathcal{E}}(h_i) \vert > \gamma$. Then from the Union Bound we have:

$$ \begin{eqnarray} P(\exists h_i \in \mathcal{H}.\ \vert \mathcal{E}(h_i)-\hat{\mathcal{E}}(h_i) \vert > \gamma) &\leqslant& \sum_{i=1}^kP(A_i)\\ &\leqslant& \sum_{i=1}^k2\exp(-2\gamma^2m)\\ &=& 2k\exp(-2\gamma^2m) \end{eqnarray} $$

Subtracting both sides from $1$ gives:

$$ P(\forall h_i \in \mathcal{H}.\ \vert \mathcal{E}(h_i)-\hat{\mathcal{E}}(h_i) \vert \leqslant\gamma)\geqslant 1-2k\exp(-2\gamma^2m) $$

Suppose that we require that the probability that $\vert \mathcal{E}(h_i)-\hat{\mathcal{E}}(h_i) \vert \leqslant \gamma$ for all $h_i$ in $\mathcal{H}$ should at least be $1-\delta$, for some fixed $\delta, \gamma > 0$. The minimum value of $m$ required is thus:

$$ \begin{eqnarray} 1 - \delta &=& 1-2k\exp(-2\gamma^2m)\\ m &=& \frac{1}{2\gamma^2}log(\frac{2k}{\delta}) \end{eqnarray} $$

Note that this is the minimum value of $m$ required because increasing $m$ will only increase the probability $1-2k\exp(-2\gamma^2m)$$(=1-\delta)$. This minimum value of $m$ required to achieve a certain performance is known as the sample complexity.

Also note that:

$$ \begin{eqnarray} \gamma &=& \sqrt{\frac{1}{2m}log(\frac{2k}{\delta})}\\ \vert \mathcal{E(h)}-\hat{\mathcal{E}}(h)\vert &\leqslant& \sqrt{\frac{1}{2m}log(\frac{2k} {\delta})} \end{eqnarray} $$

Let $\hat{h}=\underset{h}{argmin}$ $\hat{\mathcal{E}}(h)$ and $h^*=\underset{h}{argmin}$ $\mathcal{E}(h)$. Then with probability $1-\delta$ we have:

$$ \begin{eqnarray} \mathcal{E}(\hat{h}) &\leqslant& \hat{\mathcal{E}}(\hat{h})+\gamma\\ &\leqslant& \hat{\mathcal{E}}(h^*)+\gamma\\ &\leqslant& \mathcal{E}(h^*)+2\gamma\\ \end{eqnarray} $$

We may sum the above discussion in the following theorem:

Theorem. Let $\vert \mathcal{H} \vert = k$, and let any $m$, $\delta$ be fixed. Then with probability at least $1-\delta$ we have that:

$$ \underset{h\in\mathcal{H}}{argmin}\ \hat{\mathcal{E}}(h) \leqslant \underset{h\in\mathcal{H}}{argmin}\ \mathcal{E}(h)+2\sqrt{\frac{1} {2m}log(\frac{2k}{\delta})}\\ $$

Note that switching to a larger hypothesis space $\mathcal{H}' \supseteq \mathcal{H}$ will only decrease the first term (the “bias”) but increase the second term (the “variance”). This is known as the bias-variance tradeoff.

Corollary. Let $\vert \mathcal{H} \vert = k$, and let any $\gamma$, $\delta$ be fixed. Then for $\underset{h\in\mathcal{H}}{argmin}\ \hat{\mathcal{E}}(h) \leqslant \underset{h\in\mathcal{H}}{argmin}\ \mathcal{E}(h)+2\gamma$ to hold it suffices that:

$$ \begin{eqnarray} m &\geqslant& \frac{1}{2\gamma^2}log(\frac{2k}{\delta})\\ &=& O(\frac{1}{\gamma^2}log(\frac{k}{\delta}) ) \end{eqnarray} $$

The Case of Infinite (\mathcal{H})

Given a set $S = \{x^{(1)},x^{(2)},...,x^{(d)}\}$ and a hypothesis class $\mathcal{H}$, we say that $\mathcal{H}$ shatters $S$ if $\mathcal{H}$ can realize any labelling on $S$, i.e. for some $h \in \mathcal{H}$, $h(x^{(i)})=y^{(i)}$ $\forall$ $i$. We define the Vapnik-Chervonenkis (VC) dimension of $\mathcal{H}$, $VC(\mathcal{H})$, to be the largest set that can be shattered by $\mathcal{H}$. Note that to show that $d=VC(\mathcal{H})$ we only need to prove that there is at least one set of size $d$ that can be shattered by $\mathcal{H}$, i.e. there might exist other sets of size $d$ that $\mathcal{H}$ cannot shatter.

The following theorem, due to Vapnik, can then be shown:

Theorem. Let $\mathcal{H}$ be given and let $d = VC(\mathcal{H})$. Then with probability $1-\delta$ we have that for all $h \in \mathcal{H}$:

$$ \vert \mathcal{E}(h)-\hat{\mathcal{E}}(h)\vert \leqslant O\left(\sqrt{\frac{d}{m}log\frac{m}{d}+\frac{1}{m}log\frac{1}{\delta}}\right) $$

Thus, with probability $1-\delta$ we also have:

$$ \mathcal{E}(\hat{h}) \leqslant \mathcal{E}(h^*) + O\left(\sqrt{\frac{d}{m}log\frac{m}{d}+\frac{1}{m}log\frac{1}{\delta}}\right) $$

Note that as $m$ becomes large, uniform convergence occurs.

Corollary. For $\vert \mathcal{E}(h)-\hat{\mathcal{E}}(h)\vert \leqslant \gamma$ (and hence $\mathcal{E}(\hat{h}) \leqslant \mathcal{E}(h^*) +2\gamma$) to hold for all $h \in \mathcal{H}$ with probability $1-\delta$, it suffices that $m=O_{\gamma,\delta}(d)$ (where the subscripts $\gamma$, $\delta$ indicate that $O$ is hiding some constants that may depend on $\gamma$ and $\delta$).

Note that all results proved above are for algorithms that use empirical risk minimization.

Feature Selection

Consider the following problem: suppose that we are given a training set $S=\{x^{(i)},y^{(i)}; i=1,...m\}$ where each $x^{(i)}$is a high-dimensional vector representing the features of the $i$th training example. In order to reduce this high-dimensional vector (say, for reducing the amount of computation required to train our model), we would need to select a subset of, say, $k$ most “relevant” features. One way this can be done is to calculate the mutual information $MI(x_{i},y)$ between each feature $x_{i}$ and $y$:

$$ MI(x_i,y) = \sum_{x_i \in \{0,1\}}\sum_{y \in \{0,1\}}p(x_i,y)log\frac{p(x_i,y)}{p(x_i)p(y)} $$

where we have assumed that both $x_i$ and $y$ are binary-valued. In general, the summation would be over the domains of the variable. Note that this may also be expressed as a Kullback-Leibler (KL) divergence:

$$ MI(x_i,y) = KL((p(x_i,y)\vert\vert p(x_i)p(y)) $$

which, informally put, gives a measure of how different the probability distributions $p(x_i,y)$ and $p(x_i)p(y)$ are.

Bayesian Statistics

Till now, we have considered $\theta$ to be an unknown but fixed variable. This is the view taken in frequentist statistics. In contrast, the Bayesian world of statistics considers $\theta$ to be a random variable with some prior probability distribution $p(\theta)$. Note that for a training set $S=\{x^{(i)},y^{(i)}; i=1,...m\}$:

\begin{eqnarray} p(\theta \vert S) &=& \frac{P(S \vert \theta)P(\theta)}{P(S)}\\ &=& \frac{\prod_{i=1}^m P(y^{(i)}\vert x^{(i)}, \theta)P(\theta)}{\int_\theta \prod_{i=1}^m P(y^{(i)}\vert x^{(i)},\theta)p(\theta)d\theta} \end{eqnarray}

Thus, for a test example $T = (x,y)$ we have:

$$ \begin{eqnarray} p(y \vert x, S) &=& \int_\theta P(T \vert,S,\theta)d\theta\\ &=& \int_\theta \frac{P(T, S \vert \theta)}{P(S)}d\theta\\ &=& \int_\theta \frac{P(T \vert \theta)P(S \vert \theta)}{P(S)}d\theta\\ &=& \int_\theta \frac{P(T \vert \theta)P(\theta \vert S)P(S)}{P(S)P(\theta)}d\theta\\ &=& \int_\theta P(y \vert x,\theta)P(\theta \vert S)d\theta\\ \end{eqnarray} $$

Note that the third step makes use of the assumption that $S$ and $T$ are iid i.e. they both are independent of each other. Also note that the second-to-last step uses the fact that $\int_\theta P(\theta) = 1$. Note that:

$$ \mathbb{E}(y \vert x, S) = \int_y yp(y \vert x, S) $$

However, as $\theta$ is very high dimensional computing an integral over all possible $\theta$’s is very computationally expensive. So instead, we replace $\theta$ by a single-point estimate - the maximum a posteriori (MAP) estimate, given by:

$$ \theta_{MAP} = \underset{\theta}{argmax} \prod_{i=1}^m p(y^{(i)}\vert x^{(i)},\theta)p(\theta) $$

A common choice for the prior $p(\theta)$ is $\theta \sim \mathcal{N}(0,\tau^2I)$. The Bayesian approach usually leads to less overfitting.

The Perceptron and Large Margin Classifiers

Consider the problem of online learning. Suppose that we have a sequence of examples $(x^{(1)},y^{(1)}), (x^{(2)},$ $y^{(2)}),$$...,(x^{(m)},y^{(m)})$, where $y^{(i)} \in \{-1,+1\}$. Instead of feeding the entire set into our model at once, in online learning we will feed in these examples one-by-one and make a prediction at each step as follows:

$$ h_\theta(x) = g(\theta^Tx) $$

where:

$$ g(z) = \begin{cases} +1, & \text{if}\ z \geqslant 0 \\ -1, & \text{if}\ z<0 \end{cases} $$

At each step, if $h_\theta(x^{(i)})=y^{(i)}$ we will make the following update:

$$ \theta :=\theta + y^{(i)}x^{(i)} $$

The following theorem gives a bound on the number of mistakes this (perceptron algorithm) makes.

Theorem (Block, 1962 and Novikoff, 1962). Let a sequence of examples $(x^{(1)},y^{(1)}),$ $(x^{(2)},$ $y^{(2)}),$ $...,(x^{(m)},y^{(m)})$ be given. Suppose that $\vert\vert x^{(i)} \vert\vert \leqslant D$ for all $i$, and further that there exists a unit-length vector $u$ such that $y^{(i)}u^T$$x^{(i)} \geq \gamma$ for all examples in the sequence. Then the total number of mistakes the perceptron makes on this sequence is at most $(\frac{D}{\gamma})^2$.

Proof. Let $\theta^{(k)}$ be the weights of the perceptron when it makes the $k$th mistake and let $\theta^{(0)} = \vec{0}$. Let the $k$th mistake be on the example $(x^{(i)},y^{(i)})$. Therefore:

$$ (x^{(i)})^T\theta^{(k)}y^{(i)} \leqslant 0 $$

Also as $\theta^{(k+1)} = \theta{(k)} + y^{(i)}x^{(i)}$ we have:

$$ \begin{eqnarray} (\theta^{(k+1)})^Tu &=& (\theta^{(k)})^Tu + y^{(i)}(x^{(i)})^Tu\\ &\geqslant& (\theta^{(k)})^Tu + \gamma\\ &\geqslant& (\theta^{(k-1)})^Tu + \gamma + \gamma\\ &\geqslant& k\gamma \end{eqnarray} $$

Also:

$$ \begin{eqnarray} \vert\vert \theta^{(k+1)} \vert\vert^2 &=& \vert\vert \theta^{(k)} \vert\vert^2 + \vert\vert x^{(i)}\vert\vert^2 + 2\vert\vert y^{(i)}(x^{(i)})^T\theta^{(k)} \vert\vert\\ &\leqslant& \vert\vert \theta^{(k)} \vert\vert^2 + \vert\vert x^{(i)}\vert\vert^2\\ &\leqslant& \vert\vert \theta^{(k)} \vert\vert^2 + D^2\\ &\leqslant& \vert\vert \theta^{(k-1)} \vert\vert^2 + D^2 + D^2\\ &\leqslant& kD^2 \end{eqnarray} $$

Note that the first inequality used the fact that $\vert\vert y^{(i)}(x^{(i)})^T\theta^{(k)} \vert\vert \leqslant 0$.

Therefore:

$$ \begin{eqnarray} \sqrt{k}D &\geqslant& \vert\vert \theta^{(k+1)} \vert\vert\\ &\geqslant& (\theta^{(k+1)})^Tu\\ &\geqslant& k\gamma\\ \end{eqnarray} $$

Note that the second inequality above used the fact that $z^Tu = \vert\vert z \vert\vert \ \vert\vert$$u \vert\vert cos\phi \leqslant$$\vert\vert z \vert\vert \ \vert\vert u \vert\vert$ and that $u$ is a unit vector i.e. $\vert\vert u \vert\vert = 1$.

So:

$$ k \leqslant (\frac{D}{\gamma})^2 $$

The Case of Finite \(\mathcal{H}\)

The Case of Infinite (\mathcal{H})

Feature Selection

Bayesian Statistics

The Perceptron and Large Margin Classifiers