1. Kernel ridge regression
(a) We showed in the lecture notes that \(\sum_{i=1}^m(\theta^Tx^{(i)}-y^{(i)})^2 = (X\theta-\vec{y})^T(X\theta-\vec{y})\). Also, we showed that \(\nabla_\theta (X\theta-\vec{y})^T(X\theta-\vec{y}) = X^TX\theta-X^T\vec{y}\):
$$ \begin{eqnarray} \nabla_\theta J(\theta) &=& 0\\ \nabla_\theta \left(\frac{1}{2}(X\theta-\vec{y})^T(X\theta-\vec{y}) + \frac{\lambda}{2} \theta^T\theta\right) &=& 0\\ X^TX\theta - X^T\vec{y} + \nabla_\theta\frac{\lambda}{2} tr(\theta^T\theta) &=& 0\\ X^TX\theta - X^T\vec{y} + \lambda\theta &=& 0\\ \theta &=& (X^TX + \lambda I)^{-1}X^T\vec{y} \end{eqnarray} $$
(b) Note that:
$$ \begin{eqnarray} \theta^T\phi(x_{\text{new}}) &=& \left[(\phi(X)^T\phi(X)+\lambda I)^{-1}\phi(X)^T\vec{y}\right]^T\phi(x_{\text{new}})\\ &=& \vec{y}^T\phi(X)(\phi(X)^T\phi(X)+\lambda I)^{-1}\phi(x_{\text{new}})\\ &=& \vec{y}^T(\phi(X)\phi(X)^T+\lambda I)^{-1}\phi(X)\phi(x_{\text{new}}) \end{eqnarray} $$
Here, \(\phi(X)\phi(X^T)\) and \(\phi(X)\phi(x_{\text{new}})\) are just matrices whose each entry is a dot product between \(\phi(x^{(i)})\) and \(\phi(x^{(j)})\) for some appropriate \(i\) and \(j\). Each of these dot products can efficiently be calculated using the kernel trick.
2. \(\mathcal{l}_2\) norm soft margin SVMs
(a) Allowing for negative \(\xi_i\) will have no effect on the minimum value of the objective function because if the constraint \(y^{(i)}(w^Tx^{(i)}+b) \geqslant 1+ \vert\xi_i\vert\) is satisfied for some negative \(\xi_i\) then it is also satisfied for \(\xi_i = 0\). However, \(\xi_i=0\) will always result in a lower value of the objective function and hence the model will end up choosing it instead.
(b) The Lagrangian is given by:
$$ \mathcal{L}(w,b,\xi_i,\alpha_i) = \frac{1}{2}\vert\vert w \vert\vert^2 + \frac{C}{2}\sum_{i=1}^m \xi_i^2 + \sum_{\substack{i=1\\\alpha_i>0}}^m\alpha_i(-y^{(i)}(w^Tx^{(i)}+b)+(1-\xi_i)) $$
(c) Minimizing the Lagrangian with respect to the parameters gives:
$$ \begin{eqnarray} \nabla_w\mathcal{L}(w,b,\xi_i,\alpha_i) &=& 0\\ w - \sum_{\substack{i=1\\\alpha_i>0}}^m\alpha_iy^{(i)}x^{(i)} &=& 0\\ w &=& \sum_{\substack{i=1\\\alpha_i>0}}^m\alpha_iy^{(i)}x^{(i)} \end{eqnarray} $$
And:
$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w,b,\xi_i,\alpha_i)}{\partial b} &=& 0\\ \sum_{\substack{i=1\\\alpha_i>0}}^m \alpha_iy^{(i)}&=& 0 \end{eqnarray} $$
Finally:
$$ \begin{eqnarray} \frac{\partial\mathcal{L}(w,b,\xi_i,\alpha_i)}{\partial \xi_j} &=& 0\\ C\xi_j-\alpha_j &=& 0\\ \xi_j &=& \frac{\alpha_j}{C} \end{eqnarray} $$
(d) Plugging the values of \(w\) and \(\xi\) found above and using the fact that \(\sum_{\substack{i=1\\\alpha_i>0}}^m \alpha_iy^{(i)}= 0\) gives:
$$ \begin{eqnarray} W(\alpha)&=& \min_{w,b,\xi_i}\mathcal{L}(w,b,\xi_i,\alpha_i)\\ &=& \sum_{i=1}^m \alpha_i- \frac{1}{2}\sum_{\substack{i=1\\\alpha_i>0}}^m \sum_{\substack{j=1\\\alpha_j>0}}^m \alpha_i\alpha_jy^{(i)}y^{(j)}{x^{(i)}}^Tx^{(j)} - \frac{1}{2}\sum_{i=1}^m\frac{\alpha_i^2}{C} \end{eqnarray} $$
The dual problem is thus given by:
$$ \max_{\alpha} W(\alpha)\\ \begin{eqnarray} s.t.\ \ \ \ \ \alpha_i &\geqslant& 0\\ \sum_{i=1}^m\alpha_iy^{(i)}&=& 0 \end{eqnarray} $$
3. SVM with Gaussian Kernel
(a) Note that to make a correct prediction on \(x^{(j)}\) we require:
$$ \begin{cases}f(x^{(j)}) > 0 & \text{if} & y^{(j)} = +1\\ f(x^{(j)}) < 0 & \text{if} & y^{(j)} = -1\end{cases} $$
Or:
$$ \begin{cases}f(x^{(j)}) - y^{(j)}> -1 & \text{if} & y^{(j)} = +1\\ f(x^{(j)}) - y^{(j)}< +1 & \text{if} & y^{(j)} = -1\end{cases} $$
This implies that if \(% <![CDATA[ \vert f(x^{(j)})-y^{(j)} \vert < 1 %]]>\) then we must have made the right decision. We need to find a value of \(\tau\) for which this inequality holds for all \(j=1,...,m\). Note that for all \(\alpha_i=1\) and \(b=0\) we have:
$$ \begin{eqnarray} \vert f(x^{(j)})-y^{(j)} \vert &=&\vert \sum_{i=1}^m y^{(i)}K(x^{(i)},x^{(j)})-y^{(j)}\vert\\ &=& \sum_{i=1}^m y^{(i)}\exp(-\frac{\vert\vert x^{(i)}-x^{(j)}\vert\vert^2}{\tau^2})-y^{(j)}\vert\\ &=&\vert \sum_{\substack{i=1\\i\neq j}}^m y^{(i)}\exp(-\frac{\vert\vert x^{(i)}-x^{(j)}\vert\vert^2}{\tau^2})\vert\\ &\leqslant& \vert \sum_{\substack{i=1\\i\neq j}}^m y^{(i)}\exp(-\frac{\epsilon^2}{\tau^2})\vert\\ &=& \exp(-\frac{\epsilon^2}{\tau^2})\vert \sum_{\substack{i=1\\i\neq j}}^m y^{(i)}\vert\\ &\leqslant& \exp(-\frac{\epsilon^2}{\tau^2})(m-1) \end{eqnarray} $$
where the last line follows from the fact that \(\vert\sum_{\substack{i=1\\i\neq j}}^m y^{(i)}\vert\) is at most \(m-1\) which corresponds to the case when all \(y^{(i)}\) (except possibly for \(i=j\) as \(y^{(j)}\) is not included in this sum) are equal to \(+1\) or \(-1\). If there exists some \(\tau\) for which \(% <![CDATA[ \exp(-\frac{\epsilon^2}{\tau^2})(m-1)<1 %]]>\) then we can be certain that \(\vert f(x^{(j)})-\) \(y^{(j)} \vert\) \(% <![CDATA[ <1 %]]>\) for all \(j\):
$$ \begin{eqnarray} \exp(-\frac{\epsilon^2}{\tau^2})(m-1) &<& 1\\ 2\log\exp(-\frac{\epsilon}{\tau}) &<& \log\frac{1}{m-1}\\ \log\exp(-\frac{\epsilon}{\tau}) &<& \log\frac{1}{m-1}\\ -\frac{\epsilon^2}{\tau^2} &<& \log\frac{1}{m-1}\\ \frac{\epsilon}{\tau} &>& \log(m-1)\\ \tau &<& \frac{\epsilon}{\log(m-1)} \end{eqnarray} $$
(b) Yes, the model will always achieve zero training error. This is because the objective function with slack variables \(\xi_i \geqslant 0\) contains the term \(C\sum_{i=1}^m \xi_i\), where \(C\) is some constant. One can, therefore, always choose the term \(C\) to be so large (i.e \(C \rightarrow \infty\)) so that the objective function is minimized only when all \(\xi_i=0\).
(c) No, the model will not necessarily achieve zero training error. Suppose that \(C=0\). Then the objective function will end up being just \(\min_w \frac{1}{2}\vert\vert w \vert\vert^2\). The model could then just choose \(w=0\) (note that the objective function is minimized for this value of \(w\)) and pick \(\xi_i\) that will satisfy the constraint \(y^{(i)}(w^Tx^{(i)}+b)\geqslant1-\xi_i\) for all \(i\).
5. Uniform Convergence
(a) Let \(A_i\) be the event that \(\mathcal{\hat{E}}(h_i)=0 \vert \mathcal{E}(h_i) > \gamma\). Note that \(\mathcal{E}(h_i) > \gamma\) implies that the probability of the hypothesis \(h_i\) correctly classifying an example is equal to \(1-\gamma\). Hence the probability of \(h_i\) classifying all \(m\) training examples correctly is \((1-\gamma)^m\). This, by definition, is equal to \(P(A_i)\). So:
$$ \begin{eqnarray} P(\exists h_i \in \mathcal{H}: \mathcal{\hat{E}}(h_i) = 0 \vert \mathcal{E}(h_i) > \gamma) &\leqslant& \sum_{i=1}^k P(A_i)\\ &=& \sum_{i=1}^k (1-\gamma)^m\\ &\leqslant& k\exp(-\gamma m) \end{eqnarray} $$
Note that:
$$ 1- P(\exists h_i \in \mathcal{H}: \mathcal{\hat{E}}(h_i) = 0 \vert \mathcal{E}(h_i) > \gamma) = P(\forall h_i \in \mathcal{H}: \mathcal{\hat{E}}(h_i) = 0 \vert \mathcal{E}(h_i) \leqslant \gamma) \geqslant 1-k\exp(-\gamma m) $$
We want this to hold with probability at least \(1-\delta\). Therefore:
$$ \begin{eqnarray} 1-\delta &\geqslant& 1 -k\exp(-\gamma m)\\ \exp(-\gamma m) &\geqslant& \frac{\delta}{k}\\ -\gamma m &\geqslant& \log\frac{\delta}{k}\\ \gamma &\leqslant& \frac{1}{m}\log\frac{k}{\delta} \end{eqnarray} $$
Hence, with probability \(1-\delta\):
$$ \mathcal{E}(\hat{h}) \leqslant \frac{1}{m}\log\frac{k}{\delta} $$
where \(\hat{h}\) is a hypothesis that achieves zero training error.
(b) Clearly:
$$ m = \frac{1}{\gamma}\log\frac{k}{\delta} $$
satisfies \(\mathcal{E}(\hat{h})\leqslant \gamma\) with probability \(1-\delta\), so for this to always hold it suffices that:
$$ m \geqslant \frac{1}{\gamma}\log\frac{k}{\delta} $$