$\mathcal{D}=(X,y)$. $X$ is $N\times D$. $y$ is $N \times 1$.
Computational complexity of computing $\alpha^* = (K+\lambda I)^{-1} y$:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
N = np.arange(0,10**5,1000)
size_element = 8 #bytes
Memory_GB = size_element*N**2/10**9/2
power_gpu_GHz = 1.25
operations_per_second_cpu = power_gpu_GHz*10**9 #1.25GHz
time_hours = (N**3/operations_per_second_cpu) / 3600
fig,ax = plt.subplots(1,2,figsize=(16,6))
ax[0].plot(N,Memory_GB)
ax[1].plot(N,time_hours)
ax[0].set_xlabel("N")
ax[0].set_ylabel("Memory (GB)")
ax[0].set_title("Memory K matrix")
ax[1].set_xlabel("N")
ax[1].set_title("CPU %.2fGHz"%power_gpu_GHz)
_=ax[1].set_ylabel("Time (hours)")
$$ k(X,X)=K_{ff} = \begin{pmatrix} k(x_1,x_{1}) & k(x_1,x_{2}) &...& k(x_1,x_{N}) \\ k(x_2,x_{1}) & k(x_2,x_{2}) &...& k(x_2,x_{N}) \\ \vdots & \vdots & \ddots &\vdots \\ k(x_N,x_{1}) & k(x_N,x_{2}) &...& k(x_N,x_{N}) \\ \end{pmatrix} \in \mathbb{R}^{N\times N} $$
Suppose we have a kernel function that measures similarities between points in our input space $\mathcal{X}$.
$$ \begin{aligned} k: &\mathcal{X} \times \mathcal{X} \longrightarrow \mathbb{R}^{+}\\ &(x,x')\longrightarrow k(x,x') \end{aligned} $$
This kernel function defines a Reproducing kernel Hilbert Space (RKHS) [Mercer 1907] that we will call $\mathcal{F}$. (By means of applying the Moore–Aronszajn theorem)
$\mathcal{F}$ is an space of functions $f\in \mathcal{F}$ is a function $f:\mathcal{X} \longrightarrow \mathbb{R}$.
This space of functions is formed by functions of the form $f(\color{blue}x)=\sum_{i=1}^\infty a_i k(x_i,\color{blue}x)$. (where $\sum_{i=1}^\infty a_i^2 k(x_i,x_i)< \infty$).
The scalar product between two functions in this space of functions is computed as follows: $$ \langle f,g \rangle_{\mathcal{F}} = \langle \sum_{i=1}^{\infty}a_i k(x_i,\cdot), \sum_{i=1}^{\infty}b_i k(y_i,\cdot)\rangle_{\mathcal{F}}:=\sum_{i=1}^\infty\sum_{j=1}^\infty a_i b_j k(x_i,y_j) $$
Input space $\mathcal{X}=\mathbb{R}^D$.
Let $\mathcal{D}=\{x_i,y_i\}_{i=1}^N$ be our data for our regression problem $x_i\in \mathcal{X}$ and $y_i \in \mathbb{R}$. Kernel Ridge Regression (KRR) aims to solve the problem:
$$ f^* = \arg \min_{\color{red}f \in \mathcal{F}} \underbrace{\overbrace{\sum_{i=1}^N \big( \color{red}f(x_i) - y_i \big)^2}^{Error(\color{red}f)} + \overbrace{\lambda\|\color{red}f\|^2_{\mathcal{F}}}^{Reg(\color{red}f)}}_{J(\color{red}f)} $$
The [Representer Theorem] states that the $f^*$ minimum of $J(\color{red}f)$ can be represented as: $$ f^*(\color{blue}x) = \sum_{i=1}^N\alpha_i k(x_i,\color{blue}x) = \overbrace{k(\color{blue}x,X)}^{1 \times N}\underbrace{\alpha}_{^{N \times 1}} $$
The risk of $f^*$:
$$ \begin{aligned} J(f^*) &= \sum_{i=1}^N \big(f^*(x_i) - y_i \big)^2 + \lambda\|f^*\|^2_{\mathcal{F}} \\ &= \sum_{i=1}^N \big(k(x_i,X)\alpha - y_i \big)^2 + \lambda\sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j k(x_i,x_j) \\ &= \|K \alpha - y \|^2 + \lambda\alpha^t K \alpha = J(\alpha) \end{aligned} $$
We transformed the minimization in the space of functions of the RKHS of k into a minimization problem (linear ridge regression actually) in $\alpha \in \mathbb{R}^N$!
What is the $\alpha^*$ that minimizes $J(\alpha)$?
What is the $\alpha^*$ that minimizes $J(\alpha)$?
$$ \alpha^* = (K+ \lambda I)^{-1}y $$
How do we make predictions for new $\color{purple}x$: $$ f^*(\color{purple}x) = \sum_{i=1}^N\alpha_i^* k(x_i,\color{purple}x) = \overbrace{k(\color{purple}x,X)}^{1 \times N}\underbrace{\alpha^*}_{^{N \times 1}} $$
What is the value of the risk, $J$ at $\alpha^*$?
$$ \begin{aligned} J(\alpha^*) &= \lambda y^t(K+\lambda I)^{-1} y \\ &= y^t\left(\lambda^{-1}K+ I\right)^{-1} y \end{aligned} $$
We first develop the Risk term: $$ \begin{aligned} J(\alpha) &= \|K \alpha - y \|^2 + \lambda\alpha^t K \alpha \\ &= (K\alpha -y)^t (K\alpha -y) + \lambda\alpha^t K \alpha \\ &= \alpha^t K^t K\alpha -2 \alpha^t K y + y^t y + \lambda\alpha^t K \alpha \end{aligned} $$
We compute the gradient: $$ \begin{aligned} \nabla_\alpha J(\alpha) &= 2 K K\alpha -2 K y + 2\lambda K \alpha \\ &= 2 K \big( K\alpha - y + \lambda \alpha\big) \\ &= 2 K \big( (K+ \lambda I) \alpha - y \big) \end{aligned} $$
Now we just solve the system: $\nabla_\alpha J(\alpha) = 0$ $\implies$ $(K+ \lambda I) \alpha - y = 0$ $\implies$ $\alpha^* = (K+ \lambda I)^{-1}y$
What is the training error ($Error(\color{red}f)$) that the minimum $\alpha^*$ incurs: $$ \require{cancel} \begin{aligned} Error(\alpha^*) &= \|K\alpha^* -y\|^2 = \|K(K+\lambda I)^{-1}y - y\|^2 \\ &= \|( K+\lambda I -\lambda I)(K+\lambda I)^{-1}y - y\|^2 \\ &= \|\cancel{( K+\lambda I)(K+\lambda I)^{-1}}y -\lambda I(K+\lambda I)^{-1}y - y\|^2 \\ &= \|\cancel{y} -\lambda (K+\lambda I)^{-1}y - \cancel{y}\|^2 \\ &= \lambda^2 \| (K+\lambda I)^{-1}y\|^2 \\ &= \lambda^2 \| \alpha^* \|^2 \end{aligned} $$
What is the regularization penalty ($Reg(\color{red}f)$) of $\alpha^*$: $$ \begin{aligned} \frac{Reg(\alpha^*)}{\lambda} &= \alpha^{*t}K\alpha^* \\ &= y^t(K+\lambda I)^{-1}K(K+\lambda I)^{-1} y \\ &= y^t(K+\lambda I)^{-1}(K+\lambda I- \lambda I)(K+\lambda I)^{-1} y \\ &= y^t\cancel{(K+\lambda I)^{-1}}\cancel{(K+\lambda I)}(K+\lambda I)^{-1} y -y^t(K+\lambda I)^{-1}(\lambda I)(K+\lambda I)^{-1} y \\ &= y^t(K+\lambda I)^{-1} y -\lambda y^t(K+\lambda I)^{-1}(K+\lambda I)^{-1} y \\ &= y^t(K+\lambda I)^{-1} y -\lambda\|(K+\lambda I)^{-1} y \|^2 \\ &= y^t(K+\lambda I)^{-1} y -\lambda\|\alpha^* \|^2 \\ Reg(\alpha^*) &= \lambda y^t(K+\lambda I)^{-1} y -\lambda^2\|\alpha^* \|^2 \end{aligned} $$
Putting things together:
$$ \begin{aligned} J(f^*) = J(\alpha^*) &= Error(\alpha^*) + Reg(\alpha^*) \\ &= \cancel{\lambda^2 \| \alpha^* \|^2} + \lambda y^t(K+\lambda I)^{-1} y -\cancel{\lambda^2\|\alpha^* \|^2} \\ &= \lambda y^t(K+\lambda I)^{-1} y \\ &= \lambda y^t\left(\lambda\left(\lambda^{-1}K+ I\right)\right)^{-1} y \\ &= y^t\left(\lambda^{-1}K+ I\right)^{-1} y \end{aligned} $$
Computational complexity of computing $\alpha^* = (K+\lambda I)^{-1} y$:
Cross-validation to adjust $\lambda$ and the parameters of the kernel ($\Theta$).
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
N = np.arange(0,10**5,1000)
size_element = 8 #bytes
Memory_GB = size_element*N**2/10**9/2
power_gpu_GHz = 1.25
operations_per_second_cpu = power_gpu_GHz*10**9 #1.25GHz
time_hours = (N**3/operations_per_second_cpu) / 3600
fig,ax = plt.subplots(1,2,figsize=(16,6))
ax[0].plot(N,Memory_GB)
ax[1].plot(N,time_hours)
ax[0].set_xlabel("N")
ax[0].set_ylabel("Memory (GB)")
ax[0].set_title("Memory K matrix")
ax[1].set_xlabel("N")
ax[1].set_title("CPU %.2fGHz"%power_gpu_GHz)
_=ax[1].set_ylabel("Time (hours)")
Instead of represent $f^*$ with the $N$ points $X$ represent it using only $M << N$ points $X_u=\{x_{ui}\}_{i=1}^M\subset X$: $$ \color{blue}{f^*}(x) = \sum_{i=1}^{N} \color{blue}{\alpha_i} k(\color{blue}{x_i},x) = \overbrace{k(x,\color{blue}X)}^{1 \times N}\underbrace{\color{blue}\alpha}_{^{N \times 1}} $$
$$ \color{red}{f^*}(x) = \sum_{i=1}^{M} \color{red}{\alpha_{ui}} k(\color{red}{x_{ui}},x) = \overbrace{k(x,\color{red}{X_u})}^{1 \times M}\underbrace{\color{red}{\alpha_u}}_{^{M \times 1}} $$
$$ \begin{aligned} J(\color{blue}{f^*}) &= \sum_{i=1}^N \big(\color{blue}{f^*}(x_i) - y_i \big)^2 + \lambda\|\color{blue}{f^*}\|^2_{\mathcal{F}} \\ &= \sum_{i=1}^N \big(k(x_i,\color{blue}{X})\color{blue}{\alpha} - y_i \big)^2 + \lambda\sum_{i=1}^N \sum_{j=1}^N \color{blue}{\alpha_i} \color{blue}{\alpha_j} k(\color{blue}{x_i},\color{blue}{x_j}) \\ &= \|\color{blue}K \color{blue}\alpha - y \|^2 + \lambda\color{blue}\alpha^t \color{blue}K \color{blue}\alpha = J(\color{blue}\alpha) \end{aligned} $$
$$ \begin{aligned} J(\color{red}{f^*}) &= \sum_{i=1}^N \big(\color{red}{f^*}(x_i) - y_i \big)^2 + \lambda\|\color{red}{f^*}\|^2_{\mathcal{F}} \\ &= \sum_{i=1}^N \big(k(x_i,\color{red}{X_u})\color{red}{\alpha_u} - y_i \big)^2 + \lambda\sum_{i=1}^M \sum_{j=1}^M \color{red}{\alpha_{ui}} \color{red}{\alpha_{uj}} k(\color{red}{x_i},\color{red}{x_j}) \\ &= \|k(X_f,\color{red}{X_u}) \color{red}{\alpha_u} - y \|^2 + \lambda\color{red}{\alpha_u}^t \color{red}{K_{uu}} \color{red}{\alpha_u} = J(\color{red}{\alpha_u}) \end{aligned} $$
$$ k(X_f,X_u)=K_{fu} = \begin{pmatrix} k(x_1,x_{u_1}) & k(x_1,x_{u_2}) &...& k(x_1,x_{u_M}) \\ k(x_2,x_{u_1}) & k(x_2,x_{u_2}) &...& k(x_2,x_{u_M}) \\ \vdots & \vdots & \ddots &\vdots \\ k(x_N,x_{u_1}) & k(x_N,x_{u_2}) &...& k(x_N,x_{u_M}) \\ \end{pmatrix} \in \mathbb{R}^{N\times M} $$
Finding the $\color{red}{\alpha^*}$ that $\nabla_{\color{red}\alpha}J(\color{red}{\alpha^*}) = 0$ leads to:
$$ \begin{aligned} \color{red}{\alpha_u^*} &= \big(K_{uf} K_{fu} + \lambda K_{uu}\big)^{-1}K_{uf}y\\ &= K_{uu}^{-1}K_{uf}\big(\underbrace{K_{fu}K_{uu}^{-1}K_{uf}}_{Q_{ff}} + \lambda I\big)^{-1}y \end{aligned} $$
The Risk $J$ at this $\color{red}{\alpha_u^*}$ is: $$ \begin{aligned} J(\color{red}{f^*}) = J(\color{red}{\alpha_u^*}) &= \lambda y^t\big(\overbrace{K_{fu}K_{uu}^{-1}K_{uf}}^{Q_{ff}}+ \lambda I\big)^{-1} y \\ &= y^t \left( I -K_{fu}\Big(K_{uu}^{-1} + K_{uf}K_{fu}\Big)^{-1}K_{uf} \right)y \end{aligned} $$
Matrix Inversion Lemma to see the equivalences!
$$ \begin{aligned} J(\color{red}{\alpha_u}) &= \|K_{fu} \color{red}{\alpha_u} - y \|^2 + \lambda\color{red}{\alpha_u}^t K_{uu} \color{red}{\alpha_u} \\ &= (K_{fu}\alpha -y)^t (K_{fu}\color{red}{\alpha_u} -y) + \lambda\color{red}{\alpha_u}^t K_{uu} \color{red}{\alpha_u}\\ &= \color{red}{\alpha_u}^t K_{uf} K_{fu}\color{red}{\alpha_u} -2 \color{red}{\alpha_u}^t K_{uf} y + y^t y + \lambda\color{red}{\alpha_u}^t K_{uu} \color{red}{\alpha_u} \end{aligned} $$
$$ \begin{aligned} \nabla_\alpha J(\color{red}{\alpha_u}) &= 2 K_{uf} K_{fu}\color{red}{\alpha_u} -2 K_{uf} y + 2\lambda K_{uu} \color{red}{\alpha_u} \\ &= 2 \big( K_{uf} K_{fu}\color{red}{\alpha_u} - K_{uf}y + \lambda K_{uu}\color{red}{\alpha_u}\big) \\ &= 2 \big( (K_{uf} K_{fu} + \lambda K_{uu})\color{red}{\alpha_u} - K_{uf}y \big) \end{aligned} $$
$\nabla_\alpha J(\color{red}{\alpha_u}) = 0$ $\implies$ $(K_{uf} K_{fu} + \lambda K_{uu})\color{red}{\alpha_u} - K_{uf}y = 0$ $\implies$
$$ \begin{aligned} \color{red}{\alpha_u^*} &= \big(K_{uf} K_{fu} + \lambda K_{uu}\big)^{-1}K_{uf}y\\ \color{red}{\alpha_u^*} &= K_{uu}^{-1}K_{uf}\big(\underbrace{K_{uf}K_{uu}^{-1} K_{fu}}_{Q_{ff}} + \lambda I\big)^{-1}y \end{aligned} $$
Last step: Matrix Inversion Lemma.
$$ \Big(Z + UWV^t\Big)^{-1} = Z^{-1} -Z^{-1} U\Big(W^{-1} + V^tZ^{-1}U\Big)^{-1}V^tZ^{-1} $$ $Z$ and $W$ are square invertible matrices. $V$ and $U$ don't need to be.
In our problem: $$ \begin{aligned} \Big(\lambda I + \underbrace{K_{fu}K_{uu}K_{fu}^t}_{Q_{ff}}\Big)^{-1} &= \lambda^{-1} I -\lambda^{-1}Id K_{fu}\Big(K_{uu}^{-1} + K_{fu}^tI\lambda^{-1}K_{fu}\Big)^{-1}K_{fu}^tI\lambda^{-1} \\ &= \lambda^{-1} I -\lambda^{-1}K_{fu}\Big(K_{uu}^{-1} + K_{uf}K_{fu}\Big)^{-1}K_{uf} \\ \end{aligned} $$
What is the $Error(\color{red}{\alpha_u^*})$:
$$ \begin{aligned} Error(\color{red}{\alpha_u^*}) &= \|K_{fu} \color{red}{\alpha_u^*} - y \|^2 \\ &= \Big\|\overbrace{K_{fu}K_{uu}^{-1}K_{uf}}^{Q_{ff}} \Big(Q_{ff} + \lambda I\Big)^{-1}y - y \Big\|^2 \\ &= \lambda^2\left\| \left(Q_{ff} + \lambda I\right)^{-1}y \right\|^2 \text{ Trick add and substract }\lambda I\\ &= \lambda^2\left\| \color{red}{\alpha_u^*} \right\|^2 \end{aligned} $$ Watch out: if $\lambda=0$ $\implies$ $Q_{ff}$ can't be inverted.
What is the regularization term of $Reg(\color{red}{\alpha_u^*})$:
$$ \begin{aligned} \frac{Reg(\color{red}{\alpha_u^*})}{\lambda} &= \color{red}{\alpha_u^*}^t K_{uu}\color{red}{\alpha_u^*} \\ &= \left(Q_{ff} + \lambda I\right)^{-1}K_{fu}K_{uu}^{-1}\cancel{K_{uu}}\cancel{K_{uu}^{-1}}K_{uf}\left(Q_{ff} + \lambda I\right)^{-1} \\ &= \left(Q_{ff} + \lambda I\right)^{-1}Q_{ff}\left(Q_{ff} + \lambda I\right)^{-1} \\ &= y^t(Q_{ff}+\lambda I)^{-1} y -\lambda\|\color{red}{\alpha_u^*} \|^2 \\ Reg(\color{red}{\alpha_u^*}) &= \lambda y^t\left(Q_{ff}+\lambda I\right)^{-1} y -\lambda^2\|\color{red}{\alpha_u^*} \|^2 \end{aligned} $$
Putting things together:
$$ \begin{aligned} J(\color{red}{f^*}) = J(\color{red}{\alpha_u^*}) &= Error(\color{red}{\alpha_u^*}) + Reg(\color{red}{\alpha_u^*}) \\ &= \lambda y^t\left(Q_{ff}+ \lambda I\right)^{-1} y \\ &= \cancel{\lambda} y^t \left(\cancel{\lambda^{-1}} I -\cancel{\lambda^{-1}}K_{fu}\Big(K_{uu}^{-1} + K_{uf}K_{fu}\Big)^{-1}K_{uf} \right)y \\ &= y^t \left( I -K_{fu}\Big(K_{uu}^{-1} + K_{uf}K_{fu}\Big)^{-1}K_{uf} \right)y \end{aligned} $$
Last step: Matrix inversion lemma.
How do we make predictions for new $\color{purple}x$: $$ \color{red}{f^*}(\color{purple}x) = \sum_{i=1}^M\color{red}{\alpha_{ui}^*} k(x_{ui},\color{purple}x) = \overbrace{k(\color{purple}x,X_u)}^{1 \times M}\underbrace{\color{red}{\alpha_{ui}^*}}_{^{M \times 1}} $$
Computational complexity: $\mathcal{O}(NM^2)$.
$\lambda$ and parameters of the kernel $\Theta$ by cross-validation.
We will define again $\color{green}{f^*}$ depending on a set of $M$ points $X_u=\{x_{ui}\}_{i=1}^M$. But do not restrict to $X_u\subset X$.
$\color{green}{X_u}$ are free variables in the input space.
$$ \color{green}{f^*}(x) = \sum_{i=1}^{M} \color{green}{\alpha_{ui}} k(\color{green}{x_{ui}},x) = \overbrace{k(x,\color{green}{X_u})}^{1 \times M}\underbrace{\color{green}{\alpha_u}}_{^{M \times 1}} $$
The risk $J$ would be: $$ \begin{aligned} J(\color{green}{f^*}) &= \sum_{i=1}^N \big(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i \big)^2 + \lambda\sum_{i=1}^M \sum_{j=1}^M \color{green}{\alpha_{ui}} \color{green}{\alpha_{uj}} k(\color{green}{x_i},\color{green}{x_j}) \\ &= \|k(X_f,\color{green}{X_u}) \color{green}{\alpha_u} - y \|^2 + \lambda\color{green}{\alpha_u}^t \color{green}{K_{uu}} \color{green}{\alpha_u}= J(\color{green}{\alpha_u},\color{green}{X_u}) \end{aligned} $$
$$ \begin{aligned} \color{green}{X_u}(t+1) &= \color{green}{X_u}(t) + \epsilon \nabla_{\color{green}{X_u}} J(\color{green}{\alpha_u}(t),\color{green}{X_u}(t))\\ \color{green}{\alpha_u}(t+1) &= \color{green}{\alpha_u}(t) + \epsilon \nabla_{\color{green}{\alpha_u}} J(\color{green}{\alpha_u}(t),\color{green}{X_u}(t)) \end{aligned} $$
Initialized with:
The same but computing stochastic estimates of the gradients $\nabla_{X_u}J$ and $\nabla_{\alpha_u}J$.
If our gradient is an expectation over all the data $\mathcal{D}=\{x_i,y_i\}_{i=1}^N$: $$ \nabla_{\alpha}J(\alpha) = \frac{1}{N}\sum_{i=0}^N g(\alpha,x_i,y_i ) $$ Then if we take a subset $\mathcal{D}_{k} \subset \mathcal{D}$:
$$ \nabla_{\alpha}J(\alpha) \approx \frac{1}{K}\sum_{k=0}^K g(\alpha,x_k,y_k ) $$
SGD: same solution than gradient descent but using the approximate gradient.
$$ \begin{aligned} \color{green}{X_u}(t+1) &= \color{green}{X_u}(t) + \epsilon \nabla_{\color{green}{X_u}} J(\color{green}{\alpha_u}(t),\color{green}{X_u}(t))\\ \color{green}{\alpha_u}(t+1) &= \color{green}{\alpha_u}(t) + \epsilon \nabla_{\color{green}{\alpha_u}} J(\color{green}{\alpha_u}(t),\color{green}{X_u}(t)) \end{aligned} $$
Computational complexity does not depend on $N$!
The loss $J$ can be decomposed for stochastic gradient descent:
$$ \begin{aligned} \nabla_{\color{green}{\alpha_u}} J(\color{green}{\alpha_u},\color{green}{X_u}) &= \sum_{i=0}^N 2\left(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i\right) k(x_i,\color{green}{X_u}) + 2\lambda k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \\ \frac{\nabla_{\color{green}{\alpha_u}} J(\color{green}{\alpha_u},\color{green}{X_u})}{2N} &= \frac{1}{N}\sum_{i=0}^N \left(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i\right) k(x_i,\color{green}{X_u}) +\frac{\lambda}{N} k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \\ &= \mathbb{E}_{(x,y)\sim \mathcal{D}} \left[(k(x,\color{green}{X_u})\color{green}{\alpha_u} - y) k(x,\color{green}{X_u})\right] +\frac{\lambda}{N} k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \end{aligned} $$
Compute the stochastic gradient: expectations using small subsets of $\mathcal{D}$.
$$ \begin{aligned} \nabla_{\color{green}{X_u}} J(\color{green}{\alpha_u},\color{green}{X_u}) &= \sum_{i=0}^N 2\left(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i\right) \nabla_{\color{green}{X_u}}k(x_i,\color{green}{X_u})\color{green}{\alpha_u} + \lambda \color{green}{\alpha_u}^t \nabla_{\color{green}{X_u}}k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \\ \frac{\nabla_{\color{green}{X_u}} J(\color{green}{\alpha_u},\color{green}{X_u})}{2N} &= \frac{1}{N}\sum_{i=0}^N \left(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i\right) \nabla_{\color{green}{X_u}}k(x_i,\color{green}{X_u})\color{green}{\alpha_u} + \frac{\lambda}{2N} \color{green}{\alpha_u}^t \nabla_{\color{green}{X_u}}k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \\ &= \mathbb{E}_{(x,y)\sim \mathcal{D}} \left[\left(k(x_i,\color{green}{X_u})\color{green}{\alpha_u} - y_i\right) \nabla_{\color{green}{X_u}}k(x_i,\color{green}{X_u})\color{green}{\alpha_u}\right] + \frac{\lambda}{2N} \color{green}{\alpha_u}^t \nabla_{\color{green}{X_u}}k(\color{green}{X_u},\color{green}{X_u})\color{green}{\alpha_u} \end{aligned} $$
Recall the SoR solution $$ \begin{aligned} \color{red}{\alpha_u^*} &= \big(K_{uf} K_{fu} + \lambda K_{uu}\big)^{-1}K_{uf}y\\ &= K_{uu}^{-1}K_{uf}\big(\underbrace{K_{fu}K_{uu}^{-1}K_{uf}}_{Q_{ff}} + \lambda I\big)^{-1}y \end{aligned} $$ The value of the risk at the optimal $\color{red}{\alpha_u^*}$: $$ \begin{aligned} J(\color{red}{f^*}) = J(\color{red}{\alpha_u^*}) &= \lambda y^t\big(\overbrace{K_{fu}K_{uu}^{-1}K_{uf}}^{Q_{ff}}+ \lambda I\big)^{-1} y \\ &= y^t \left( I -K_{fu}\Big(K_{uu}^{-1} + K_{uf}K_{fu}\Big)^{-1}K_{uf} \right)y \end{aligned} $$
We can minimize the optimal risk w.r.t. only $\color{green}{X_u}$:
$$ \begin{aligned} \hat{J}(\color{green}{X_u}) &= J(\color{green}{\alpha_u^*},\color{green}{X_u}) = \lambda y^t\big(\overbrace{K(X,\color{green}{X_u})\color{green}{K_{uu}}^{-1}K(\color{green}{X_u},X)}^{\color{green}{Q_{ff}}}+ \lambda I\big)^{-1} y \\ &= y^t \left( I -K(X,\color{green}{X_u})\Big(\color{green}{K_{uu}}^{-1} + K(\color{green}{X_u},X)K(X,\color{green}{X_u})\Big)^{-1}K(\color{green}{X_u},X) \right)y \\ \end{aligned} $$
Algorithm: $$ \color{green}{X_u}(t+1) = \color{green}{X_u}(t) + \epsilon \nabla_{\color{green}{X_u}} \hat{J}(\color{green}{X_u}(t)) $$
Using as starting point: $\color{green}{X_u}(0) \subset X$.
Computational complexity of computing the gradient $\nabla_{\color{green}{X_u}}\hat{J}$ is: $\mathcal{O}(DNM^2)$ [Quiñonero-Candela 2005].
How do we make predictions for new $\color{purple}x$: $$ \color{green}{f^*}(\color{purple}x) = \sum_{i=1}^M\color{green}{\alpha_{ui}^*} k(\color{green}{x_{ui}^*},\color{purple}x) = \overbrace{k(\color{purple}x,\color{green}{X_u^*})}^{1 \times M}\underbrace{\color{green}{\alpha_{ui}^*}}_{^{M \times 1}} $$
$\mathcal{GP}s$ is a bayesian regression method that assumes that the outputs $y^*$ given the inputs $X^*$ follows a multidimensional gaussian distribution $y^*\mid X^* \sim \mathcal{N}(0,k(X^*,X^*)+\sigma^2 I)$.
When we observe some data $\mathcal{D}=(X,y)$:
The mean function of this probabilistic model is the same as the KRR prediction!
By assumption, the model for the data is: $$ p(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} k(X,X)+\sigma^2 I & k(X,X^*) \\ k(X^*,X) &k(X^*,X^*)+\sigma^2 I \end{pmatrix}\right) $$
So we can compute the conditional distribution wikipedia: $$ \begin{aligned} p(y^*\mid X^*,X,y) = \mathcal{N}\Big(&k(X^*,X)\left(K+\sigma^2\right)^{-1}y,\\ & k(X^*,X^*) +\sigma^2 I -k(X^*,X)\left(K+\sigma^2I\right)^{-1}k(X,X^*)\Big) \end{aligned} $$
Many proposals over the last years:
nombres = {'snelson_sparse_2006': "Propose to optimize pseudo-inputs (FITC)",
'seeger_fast_2003': "SoR/DTC + search best pseudo-inputs in X",
"williams_using_2000": "Nystrom",
"quinonero-candela_unifying_2005": "Unifying Framework",
"lazaro-gredilla_sparse_2010": "Random Fourier Features: (SSGP)",
"csato_sparse_2002": "SoR/DTC + search best pseudo-inputs in X",
"titsias_variational_2009": "Variational free energy approach (VFE)",
'hensman_gaussian_2013': "Stochastic gradient descent on VFE: (SVI)",
"gal_improving_2015": "Stochastic gradient descent on SSGP",
"matthews_sparse_2015": "Titsias VFE alternative derivation",
"bauer_understanding_2016": "VFE vs FITC",
'bui_unifying_2017': "New framework using expectation propagation (EP)"
}
from IPython.display import HTML
from pybtex.database.input import bibtex
parser = bibtex.Parser()
bib_data = parser.parse_file('../../Sparse_GP.bib')
def make_cita(bibentry):
text_latex = bibentry.rich_fields.get('author')
persons_print =[person.split(",")[0] for person in str(text_latex).split("and ")]
#print(persons_print,len(persons_print))
if len(persons_print)==1:
cita = persons_print[0]
elif len(persons_print)==2:
cita = " and ".join(persons_print)
else:
cita = persons_print[0]+" et al."
cita+=" "+ str(bibentry.rich_fields.get('year'))
return cita
string = ""
for bibentry_key in sorted(bib_data.entries.keys(),key=lambda x: bib_data.entries[x].fields["year"]):
bibentry = bib_data.entries[bibentry_key]
# bibentry.rich_fields.get('title')
if bibentry_key in nombres:
#print(make_cita(bibentry),"--->",nombres[bibentry_key])
string+="<a href='%s'>[%s]</a>: <span>%s</span></br>"%(bibentry.fields["url"],make_cita(bibentry),nombres[bibentry_key])
HTML(string)
We introduce $u$ variable that is the noise-free value of the pseudo-inputs $X_u$. If we are given $u$ for the standard $\mathcal{GP}$ the posterior is: $$ \begin{aligned} p(y \mid u,X_u,X) = \mathcal{N}(& \overbrace{k(X,X_u)}^{K_{fu}} \left(K_{uu}\right)^{-1} u,\\ & K_{ff} - K_{fu}K_{uu}^{-1}K_{uf} + \sigma^2 I ) \end{aligned} $$
$\color{red}{SoR/DTC}$ approximations: $$ \begin{aligned} p_{\color{red}{SoR/DTC}}(y \mid u,X_u,X) = \mathcal{N}(& K_{fu} \left(K_{uu}\right)^{-1} u,\\ & \color{red}{0} + \sigma^2 I ) \end{aligned} $$
$\color{blue}{FITC}$ approximations: $$ \begin{aligned} p_{\color{blue}{FITC}}(y \mid u,X_u,X) = \mathcal{N}(& K_{fu} \left(K_{uu}\right)^{-1} u,\\ & \color{blue}{diag\Big[K_{ff} - \underbrace{K_{fu}K_{uu}^{-1}K_{uf}}_{Q_{ff}}\Big]} + \sigma^2 I ) \end{aligned} $$
To compute the Marginal Likelihood we integrate out $u$. $$ p(y \mid X_u,X) = \int p(y \mid u,X_u,X) \overbrace{p(u \mid X_u)}^{\mathcal{N}(0,K_{uu})} du = \mathcal{N}(0,K+\sigma^2I) $$
If we use $\color{red}{SoR/DTC}$ approximation:
$$ p_{\color{red}{SoR/DTC}}(y \mid X_u,X) = \int p_{\color{red}{SoR/DTC}}(y \mid u,X_u,X) \overbrace{p(u \mid X_u)}^{\mathcal{N}(0,K_{uu})} du = \mathcal{N}(0,\color{red}{\underbrace{K_{fu}K_{uu}^{-1}K_{uf}}_{Q_{ff}}}+\sigma^2I) $$
If we use $\color{blue}{FITC}$ approximation:
$$ p_{\color{blue}{FITC}}(y \mid X_u,X) = \int p_{\color{blue}{FITC}}(y \mid u,X_u,X) \overbrace{p(u \mid X_u)}^{\mathcal{N}(0,K_{uu})} du = \mathcal{N}(0,\color{blue}{\underbrace{K_{fu}K_{uu}^{-1}K_{uf}}_{Q_{ff}} - {diag}\big[K_{ff} -Q_{ff}\big]}+\sigma^2I) $$
Normal $\mathcal{GP}$ $$ \begin{aligned} p_{\mathcal{GP}}(y^* \mid X_u,y,X,X^*) = \mathcal{N}\Big(&K_{*f}\left(K_{ff}+\sigma^2I\right)^{-1}y,\\ & K_{**} +\sigma^2 I -\left(K_{*f}\left(K_{ff}+\sigma^2I\right)^{-1}K_{*f}\right)\Big) \end{aligned} $$
If we use $\color{red}{SoR}$ approximation:
$$ \begin{aligned} p_{\color{red}{SoR}}(y^* \mid X_u,y,X,X^*) = \mathcal{N}\Big(&\color{red}{Q_{*f}}\left(\color{red}{Q_{ff}}+\sigma^2I\right)^{-1}y,\\ & \color{red}{Q_{**}} +\sigma^2 I -\left(\color{red}{Q_{*f}}\left(\color{red}{Q_{ff}}+\sigma^2I\right)^{-1}\color{red}{Q_{*f}}\right)\Big) \end{aligned} $$
If we use $\color{purple}{DTC}$ approximation:
$$ \begin{aligned} p_{\color{purple}{DTC}}(y^* \mid X_u,y,X,X^*) = \mathcal{N}\Big(&\color{purple}{Q_{*f}}\left(\color{purple}{Q_{ff}}+\sigma^2I\right)^{-1}y,\\ & K_{**} +\sigma^2 I -\left(\color{purple}{Q_{*f}}\left(\color{purple}{Q_{ff}}+\sigma^2I\right)^{-1}\color{purple}{Q_{*f}}\right)\Big) \end{aligned} $$
To compute the posterior we are assuming the following joint priors.
We recall the join model for the normal $\mathcal{GP}$: $$ p_{\mathcal{GP}}(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} K_{ff}+\sigma^2 I & K_{f*} \\ K_{*f} I &K_{**}+\sigma^2 I \end{pmatrix}\right) $$
For $\color{red}{SoR}$ this is the model that leads to the aforementioned posterior: $$ p_{\color{red}{SoR}}(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} \color{red}{Q_{ff}}+\sigma^2 I & \color{red}{Q_{f*}} \\ \color{red}{Q_{*f}} I &\color{red}{Q_{**}}+\sigma^2 I \end{pmatrix}\right) $$
For $\color{purple}{DTC}$ [Csató and Opper 2002, Seeger et al. 2003]: $$ p_{\color{purple}{DTC}}(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} \color{purple}{Q_{ff}}+\sigma^2 I & \color{purple}{Q_{f*}} \\ \color{purple}{Q_{*f}} I &\color{purple}{K_{**}}+\sigma^2 I \end{pmatrix}\right) $$
Normal $\mathcal{GP}$ $$ \begin{aligned} p_{\mathcal{GP}}(y^* \mid X_u,y,X,X^*) = \mathcal{N}\Big(&K_{*f}\left(K_{ff}+\sigma^2I\right)^{-1}y,\\ & K_{**} +\sigma^2 I -\left(K_{*f}\left(K_{ff}+\sigma^2I\right)^{-1}K_{*f}\right)\Big) \end{aligned} $$
If we use $\color{blue}{FITC}$ approximation [Snelson and Ghahramani 2006]:
$$ \begin{aligned} p_{\color{blue}{FITC}}(y^* \mid X_u,y,X,X^*) = \mathcal{N}\Big(&\color{blue}{Q_{*f}}\left(\color{blue}{Q_{ff}}+ \color{blue}{diag\big[K_{ff} -Q_{ff}\big]}+\sigma^2I\right)^{-1}y,\\ & K_{**} +\sigma^2 I -\left(\color{blue}{Q_{*f}}\left(\color{blue}{Q_{ff}}+\color{blue}{diag\big[K_{ff} -Q_{ff}\big]}+\sigma^2I\right)^{-1}\color{blue}{Q_{*f}}\right)\Big) \end{aligned} $$
To compute the posterior we are assuming the following joint priors.
We recall the join model for the normal $\mathcal{GP}$: $$ p_{\mathcal{GP}}(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} K_{ff}+\sigma^2 I & K_{f*} \\ K_{*f} I &K_{**}+\sigma^2 I \end{pmatrix}\right) $$
For $\color{blue}{FITC}$: $$ p_{\color{blue}{FITC}}(y^*,y\mid X^*,X) = \mathcal{N}\left(0,\begin{pmatrix} \color{blue}{Q_{ff}- {diag}\big[K_{ff} -Q_{ff}\big]}+\sigma^2 I & \color{blue}{Q_{f*}} \\ \color{blue}{Q_{*f}} I &\color{blue}{K_{**}}+\sigma^2 I \end{pmatrix}\right) $$
Different approaches try to minimize/maximize similar expressions. Which one is preferable? [Bauer et al. 2016]-> VFE vs FITC.
Comparison on IASI temperature retrieval data.
!cp ../../spgp_compairson.py .
!cp ../../load_data.py .
!cp ../../keras_okrr.py .
import spgp_compairson
import numpy as np
folder_jsons = "/mnt/erc_gonzalo/git/uv/sparse_gps/salida/"
todas_stats = spgp_compairson.read_jsons(folder_jsons)
cols_mostrar = ["M","N_train","metodo","mean_squared_error_train","mean_squared_error_test"]
SoD_stats = todas_stats[(todas_stats.metodo == "SoD") & (todas_stats.N_train == 50000)]
SoD_stats = SoD_stats.sort_values(by=["M","metodo"])
N_train_show = 30000
todas_stats_fixed_N = todas_stats[(todas_stats.N_train == N_train_show) &
(todas_stats.metodo != "FITC_no_opt_gpflow")& (todas_stats.metodo != "SVI")
& (todas_stats.metodo != "SoD")]
todas_stats_fixed_N = todas_stats_fixed_N.sort_values(by=["M","metodo"])
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
from load_data import update_rc_params
update_rc_params()
METHODS = spgp_compairson.METHODS + spgp_compairson.METHODS_2
metodo_color = dict([(m,"C%d"%i) for i,m in enumerate(METHODS)])
marker_color = dict([(m,mar) for m,mar in zip(METHODS,["^","v",".","o","+","v"]+["^","v",".","o","+","v"])])
def plot_curves(df,value="mean_squared_error_test",index="M",group_name="metodo",ax=None):
ax = plt.gca()
for name, group in df.groupby(group_name):
line, =ax.plot(group[index],group[value],c=metodo_color[name],
marker=marker_color[name],label=name)
ax.set_title(value)
ax.set_xlabel(index)
def plot_curves_show(*args,**kwargs):
plt.figure()
plot_curves(*args,**kwargs)
ax = plt.gca()
value = kwargs["value"]
line, =ax.plot(SoD_stats["M"],SoD_stats[value],c=metodo_color["SoD"],
marker=marker_color["SoD"],label="SoD")
plt.legend()
plt.show()
plot_curves_show(todas_stats_fixed_N,value="mean_squared_error_test")
plot_curves_show(todas_stats_fixed_N,value="fitting_time")
#plot_curves_show(todas_stats_fixed_N,value="mean_squared_error_train")
#todas_stats.head()
M_show = 500
todas_stats_fixed_M = todas_stats[(todas_stats.M == M_show) & (todas_stats.metodo != "SoD") & (todas_stats.metodo != "FITC_no_opt") & (todas_stats.metodo != "SVI")]
todas_stats_fixed_M = todas_stats_fixed_M.sort_values(by=["N_train","metodo"])
def plot_curves_show_fixed_M(*args,**kwargs):
plt.figure()
plot_curves(*args,**kwargs)
ax = plt.gca()
value = kwargs["value"]
SoD_stats_show = SoD_stats[SoD_stats.M >= M_show]
line, =ax.plot(SoD_stats_show["M"],SoD_stats_show[value],c=metodo_color["SoD"],
marker=marker_color["SoD"],label="Full GP")
plt.legend()
plt.show()
plot_curves_show_fixed_M(todas_stats_fixed_M,index="N_train",value="mean_squared_error_test")
plot_curves_show_fixed_M(todas_stats_fixed_M,index="N_train",value="fitting_time")