Jordi’s ML blog

LASSO: Subgradient method, projected gradient descent, proximal gradient descent and coordinate descent.

2026-03-13T00:00:00+00:00

We will cover, superficially, the main optimization approaches for a l1 penalized regression. Jupyter notebook with the implementation can be found here.

The Least Absolute Shrinkage and Selection Operator (LASSO) problem

Let $\mathbf{X} \in \mathbb{R}^{N \times p}$, $\mathbf{y} \in \mathbb{R}^{N}$, being $p$ the number of “features” and $N$ the number of “observations”. The LASSO problem is:

\[\begin{aligned} \min_{\beta \in \mathbb{R}^p} \quad & \frac{1}{2N} \|\mathbf{y} - \mathbf{X}\beta\|_2^2 \\ \text{subject to} \quad & \|\beta\|_1 \leq t \end{aligned}\]

Which, as long as the constrain is active, we have a one-to-one correspondence with a Lagrangian formulation in terms of $\lambda$:

\[\min_{\beta \in \mathbb{R}^p} \left\{ \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\beta\|_2^2 + \lambda \|\beta\|_1 \right\}\]

This is probably one of the easiest interesting problems, stemming from the fact that the absolute value $\vert x \vert$ is not a differentiable function at $x = 0$. This is rather obvious since

\[\lim_{h \to 0^+} \frac{|0 + h| - |0|}{h} \neq \lim_{h \to 0^-} \frac{|0 + h| - |0|}{h}.\]

Subgradient method

A workaround (which easily holds for convex functions) is the subgradient:

A vector $g \in \mathbb{R}^n$ is a subgradient of $f:\mathbb{R}^n \rightarrow \mathbb{R}$ at $x \in \text{dom}\, f$ if $f(z) \geq f(x) + g^{\top}(z - x)$ $\forall z \in \text{dom} \,f.$ The set of all subgradients at $x$ is refered to as subdifferential ($\partial f(x)$).

For $\vert x\vert$ is actually very simple:

\[\partial \vert x\vert = \begin{cases} \{1\} & \text{if } x > 0 \\ \{-1\} & \text{if } x < 0 \\ [-1, 1] & \text{if } x = 0 \end{cases}\]

Now, we have that:

\[0 \in \partial f(x) \implies f(y) \geq f(x) \, \forall y \in \text{dom} \, f.\]

And hence that is a condition for optimality.

This allows to proceed with a subgradient method approach, akin to gradient descent! However, the convergence rate of this algorithm is $\mathcal{O}(\frac{1}{\sqrt{k} })$ for $k$ being the number of iterations. This is remarcably worse than gradient descent, which is an interesting result in itself.

The algorithm is very simple:

looking at the lasso problem, taking the derivative:

\[\frac{\partial \big (\frac{1}{2n} \|\mathbf{y} - \mathbf{X}\beta\|_2^2 + \lambda \|\beta\|_1 \big )}{\partial \beta} = -\mathbf{X}^{\top}\mathbf{y} + \mathbf{X}^{\top}\mathbf{X}\beta + \lambda \partial \|\beta\|_1.\]

And we would be kind of done. However, the step size is a tricky issue in this case. In fact, this is probably one of the most interesting bits, although we will not dive in the theory here (will be covered in some other notes).

For a convex and L-smooth function (where the gradient is Lipschitz continuous with constant L), gradient gescent is guaranteed to converge to the global optimum using a fixed step size $\eta$, provided that $0<\eta\leq\frac{1}{L}$ at a rate $\mathcal{O}(\frac{1}{k})$. However, this is not the case for the subgradient method. The guarantee of convergence to the optimum in the non-differentiable case requires diminishing step sizes, an easy option that works is, for iteration $k$:

\[\eta_k = \eta_0 / \sqrt{k}.\]

In the LASSO case, this has a clear intuition. For a fixed step we will never reach actual 0s because the subgradient of the absolute value remains constant in magnitude regardless of how close a coefficient is to the origin (which is very different from an l2 norm, for example). Instead of settling, the iterates will simply “bounce” across the axis in a perpetual oscillation of size proportional to $\eta \lambda$.

For example, 50 samples 100 dimensions, and some small noise, starting at $\eta = 0.1$. Only the first 10 dimensions are non zero but the subgradient approach gets no actual zeros!

Projected gradient descent

We are back at the constrained optimization problem formulation. The idea is to keep $\beta$ inside of the set described by the constraint $|\beta|_1 \leq t$. This is, while applying the gradient step might get the parameters out of the set, we will push them back intto it. In particular, we will find the element in the set that is closest to our current value in terms of euclidean distance.

The projection operator $P_C$ is defined as:

\[P_C(x) := \arg\min_{z \in C} \frac{1}{2} \|z - x\|_2^2,\]

where $C = \{\beta :\|\beta\|_1 \leq t \}$.

The actual solution to $P_C(x)$ is not trivial (see [2])

Somewhat suprisingly, the convergence rate of projected gradient descent is the same as gradient descent: $\mathcal{O}(\frac{1}{k})$

In this case we get ~70 zeroes.

Proximal gradient descent

The second most common approach (used in ISTA, FISTA…) is using a proximal operator:

\[\text{Prox}_h(z) := \arg\min_{\theta \in \mathbb{R}^p} \frac{1}{2} \|z - \theta\|_2^2 + h(\theta)\]

If $h(\theta)$ becomes the indicator function for a set $C$ we recover projected gradient descent, so this is indeed a generalization. The interesting bit with this is that the solution for that operator when $h(\theta) = \|\theta\|_1$ is a piece-wise soft-treshold operation. This is extremely fast.

Arriving at the soft-tresholding function is kind of fun;

\[\begin{align*} \frac{1}{2} \|z - \theta\|_2^2 + h(\theta) &= \\ \frac{1}{2} \sum_i^N \left [ (z_i - \theta_i)^2 + 2\lambda |\theta_i|\right ] & = \\ \frac{1}{2} \sum_i^N \left [ z_i^2 - 2 z_i \theta_i + \theta_i^2 + 2\lambda |\theta_i|\right ] \propto \\ \frac{1}{2} \sum_i^N \left [ - 2 z_i \theta_i + \theta_i^2 + 2\lambda |\theta_i|\right ]. \\ \end{align*}\]

Finding the minimum:

\[\sum_i^N \left [ - z_i + \theta_i + \lambda \partial |\theta_i|\right ] = 0, \\\]

using the subgradient definition, three options appear:

Case 1: If $\theta_i > 0 \implies -z_i + \theta_i + \lambda = 0 \implies \theta_i = z_i - \lambda$ (Valid only if $z_i > \lambda$)
Case 2: If $\theta_i < 0 \implies -z_i + \theta_i - \lambda = 0 \implies \theta_i = z_i + \lambda$ (Valid only if $z_i < -\lambda$)
Case 3: If $\theta_i = 0 \implies z_i = 0$

But, clearly, $z_i < 0 \implies \theta_i < 0$ and $z_i > 0 \implies \theta_i > 0$. Otherwise we won’t be minimizing! Hence, we have an implication that defines the value of $\theta_i$ from the known value of $z_i$. In particular, this boils down to:

\[\text{Prox}_{\lambda \|\cdot\|_1}(z_i) = S_\lambda(z_i) = \text{sign}(z_i) \cdot (|z_i| - \lambda)_+\]

This yields ISTA, with convergence rate $\mathcal{O}(\frac{1}{k})$.

Again, we get ~70 zeroes.

Coordinate descent

This is the most common, implemented in glmnet, used in sklearn and so on. The convex, separable objective function for LASSO can be optimized one parameter at a time:

\[\beta_k^{t+1} = \arg\min_{\beta_k} f(\beta_1^{t}, \beta_2^{t}, \dots, \beta_{k-1}^{t}, \beta_k, \dots, \beta_p^{t}).\]

This has a closed form solution using an argument very close to the ISTA one using the Soft-tresholding. Consider the LASSO problem

\[\min_{\beta} \frac{1}{N}\sum_{i=1}^{N} \left( y_i - \sum_{k=1}^{p} x_{ik}\beta_k \right)^2 + \sum_{k=1}^{p} \lambda |\beta_k|,\]

define:

\[r_i^{(j)} := y_i - \sum_{k \ne j} x_{ik}\beta_k.\]

Then the problem over each coordinate becomes:

\[\min_{\beta_j} \frac{1}{N}\sum_{i=1}^{N} (r_i^{(j)} - x_{ij}\beta_j)^2 + \lambda |\beta_j|\]

Now, expanding and taking the derivative;

\[\frac{1}{N}\sum_{i=1}^{N} \left[ r_i^{(j)2} -2x_{ij}r_i^{(j)}\beta_j + x_{ij}^2\beta_j^2 \right] + \lambda |\beta_j|,\] \[-\frac{1}{N}\sum_{i=1}^{N} x_{ij} r_i^{(j)} + \frac{1}{N}\sum_{i=1}^{N} x_{ij}^2 \beta_j + \lambda \, \partial |\beta_j| =0\]

Let

\[a = \frac{1}{N}\sum_{i=1}^{N} x_{ij}^2, \quad b = \frac{1}{N}\sum_{i=1}^{N} x_{ij} r_i^{(j)}\]

Then

\[a \beta_j + \lambda \partial |\beta_j| = b.\]

Case 1: $\beta_j > 0$

\[\beta_j a + \lambda = b \Rightarrow \beta_j = \frac{b - \lambda}{a}\]

If $b > 0 \Rightarrow \beta_j > 0$, otherwise we have a contradiction, since $a$ and $\lambda$ are positive scalars.

Case 2: $\beta_j < 0$

\[\beta_j a - \lambda = b \Rightarrow \beta_j = \frac{b + \lambda}{a}\]

Case 3: $\beta_j = 0$

which leads to the soft-thresholding update:

\[\hat{\beta}_j = \frac{ S_\lambda \left( \frac{1}{N}\sum_{i=1}^{N} r_i^{(j)} x_{ij} \right) }{ \frac{1}{N}\sum_{i=1}^{N} x_{ij}^2 }\]

This approach is nice because it finds the optimal solution coordinate wise (conditioned on the other parameters). While the implementation we use is naive and consists on recursively applying the above equation on a nested loop of iterations and parameters, SOTA implementations exploit the sparsity in cool ways to make this faster.

Putting all together

We can check the performance of the algorithms evaluating the fit to test data and the distance from the actual parameters, by the optimization step.

Note that projected and proximal gradient descent were implemented with backtracking using Armijo’s rule.

Coordinate descent is, in this case, much faster, reaching the tolerance level of $1 \times 10^{-16}$ before the 500 iterations limit. Although the comparison is not fair because the iterations are not really comparable. Regardless, it took less time than the others.

Note that projected gradient has the perfect “penalty” parameter $t = |\beta_i|_1$, $\lambda$ has been extracted by trial and error.

IIn any case, the limitations of my NumPy implementations likely dominate many of the observed performance differences. The only robust conclusion is that the subgradient method is clearly suboptimal for the vanilla LASSO problem and offers little practical justification here. In particular, one should be careful not to ignore the non-differentiability of the l1 norm For example, performing backpropagation + “gradient descent” through an absolute value function (PyTorch will allow without complaint) perhaps deserves some thought!

References

Hastie, T., Tibshirani, R., & Wainwright, M. (2015).

Statistical learning with Sparsity
Duchi, John and Shalev-Shwartz, Shai and Singer, Yoram and Chandra, Tushar (2008)

Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions
Ryan Tibshirani

Subgradient Method lecture notes

Gradient descent lecture notes

Inequality constrained optimization (SVM “the technical problem”).

2026-01-06T00:00:00+00:00

All the code for this post can be found here!

From Optimal Separating Hyperplane to Support Vector Machines

Previously we already explained the basics of Optimal Separating Hyperplanes. There we considered the problem of optimally separating hyperplanes also with expansions of the input space, which allowed perfect separability of the classes. That extension was already in Support Vector Machines (SVMs) territory, but we did not consider the soft margin. Furthermore, there we used an off-the-shelf library for the QP problem. Here we slightly generalize the framework to that of the SVMs. The focus here, however, is on the optimization part of the problem, on the “technical” side. We will dive with quite a lot of detail into inequality constraint optimization theory for convex problems, with the SVM problem bieng just an illustration (a rather simple one) of the applicability of those algorithms.

In essence, SVM is a optimal separating hyperplane with two additions; first, the mapping of the input space to a high-dimensional feature space (e.g.,polynomial expansion of degree $d$). Second, soft margin classification, which allows non-perfect separability of the classes.

After solving for the primal and plugging in the solution, we have the following dual problem for the optimal separating hyperplane:

\[\begin{align*} \text{maximize}_{\alpha} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j \\ \text{subject to} \quad & \sum_{i=1}^n \alpha_i y_i = 0, \\ & \alpha_i \geq 0, \quad i = 1, \dots, n. \end{align*}\]

Generalizing this to the soft margin hyperplane is very simple, introducing a complexity parameter $C$:

\[\begin{align*} \text{maximize}_{\alpha} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j \\ \text{subject to} \quad & \sum_{i=1}^n \alpha_i y_i = 0, \\ & 0 \le \alpha_i \le C, \quad i = 1, \dots, n. \end{align*}\]

Intuitively, $C$ is an upper bound on the Lagrange multiplier associated with an observation. It limits the importance of the constraint of proper classification of the points. The higher $C$ the more importance it has to missclasify, with a very big $C$ we recover the hard-margin classifier because the “natural” Lagrange multipliers will be below $C$.

The combination of basis expansion with maximum margin separation is extremely interesting from the statistical learning perspective. The conceptual motivation for SVMs is a very rich topic. Rather than selecting a hypothesis by directly controlling model complexity through the number of parameters or network architecture, SVMs introduce an inductive bias based on margin maximization. The inductive step is taken by restricting attention to classifiers that achieve zero empirical risk on the training data (in the separable case), and among these, choosing the one that minimizes an upper bound on the probability of misclassification on unseen data. This shift, from fitting capacity to maximum margin, marks a fundamental departure from more traditional forms of inductive bias.

We will dive into the “conceptual” problem of SVMs in a future post.

For now, let’s dive into inequality constrained convex optimization. Note that an explanation of Newton’s method for equality constrained optimization is available here. We build from there.

Inequality constrained optimization

Often when looking into how SVM works, at some point a QP solver is invoked. We will code that explicitly. Here we illustrate two interior point methods and an ad-hoc algorithm for SVMs. In the first two points we loosely follow chapter 11 from [1].

log-barrier method

It is not immediately clear how one go about solving an inequality constrained minimization problem. Equality constrained minimization seems easier, we just have to make updates orthogonal to the constraint matrix. Perhaps the first thing that one can try is to add a penalty that goes to $\infty$ assuming we are minimizing (wlog).

The Log Barrier method is a rigorous approach in that direction.

Define the barrier functional:

\[\phi(x) = - \sum_{i=1}^m \log{(-f_i(x))}.\]

This function clearly goes to $\infty$ when $x \rightarrow 0$, and it is only defined for $f_i(x) < 0$. Hence, this is a perfect penalty to add to our objective function, transforming the inequality constrained optimization problem into an equality constrained optimization one. However, to not change the objective we want to optimize, we need to make the penalty very small when the parameters are feasible. To do that a constant “small” $t > 0$ is added.

The gradient and Hessian of $\phi(x)$ are very important:

\[\begin{align*} \nabla \phi(x) &= -\sum_{i=1}^m \frac{1}{f_i(x)} \, \nabla f_i(x) \\ \nabla^2 \phi(x) &= \sum_{i=1}^m \frac{1}{f_i(x)^2} \, \nabla f_i(x) \, \nabla f_i(x)^\top + \sum_{i=1}^m \frac{1}{f_i(x)} \, \nabla^2 f_i(x) \end{align*}\]

For the SVM case we have two inequality constraints $f^i_1(\alpha) = -\alpha_i \leq 0$ ($\alpha$ must be positive) and $f^i_2(\alpha) = \alpha_i - C \leq 0$ ($\alpha$ must be below $C$). In this case the barrier functional is (note that we are maximizing, so the first sign is changed to look for $-\infty$):

\[\phi(\alpha) = \sum_{i=1}^m \big( \log{(-f^i_1(x))} + \log{( - f^i_1(x))} \big) = \sum_{i=1}^m \big( \log{\alpha_i} + \log{(C - \alpha_i)} \big).\]

Hence, the new centralizer problem is:

\[\begin{align*} \text{maximize}_{\alpha} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j + t \sum_{i=1}^n \Big( \log \alpha_i + \log (C - \alpha_i) \Big) \\ \text{subject to} \quad & \sum_{i=1}^n \alpha_i y_i = 0, \end{align*}\]

Clearly, for each value of $t$ we have a different solution for this problem; we call them $\alpha^*(t)$. Now, one could choose a very small $t$ and just fit Newton’s method for equality constraints. However, this is unlikely to work well, because the Hessian is likely to vary very rapidly when the parameters of the inequality constraints are close to 0 (as per the Hessian definition above). In that case, the Lipschitz constant of the Hessian is huge, and we may need many iterations (to find a region where the gradient is already tiny) to reach the quadratically convergent phase of Newton’s method. Hence, the idea is to decrease $t$ “slowly” and initialize the Newton’s method in the previous solution, effectively having an outer loop decreasing $t$ and an inner loop solving the respective equality constrained problems starting at the solution of the previous iteration.

In any case, it is not completely obvious that this should yield the actual optimal for the original inequality constrained problem. Nevertheless, this is actually the case. In particular, for the general minimization case (assuming $f_0$ convex and differentiable, affine equality constraints ($h_i$) and convex inequality constraints ($f_i$)):

\[\lim_{t \rightarrow 0} f_0(x^*(t)) = f_0(x^*).\]

When $t$ is very small, then the objective value associated to the optimal parameters of the centralizer problem does indeed converge to the objective value of the optimal parameters in the original problem. To see this, note the following parallelism between the stationarity condition of the Lagrangian of the centralizer problem and the stationary condition of the original problem. For the centralizer problem:

\[\nabla f_0(x^*(t)) + t \nabla \phi (x^*(t)) + A^{\top}\nu^*(t) = 0,\]

for the original problem:

\[\nabla f_0(x^*) + \sum_{i=1}^m \lambda^*_i \nabla f_i(x^*) + A^{\top}\nu^* = 0.\]

Since $\nabla \phi(x) = -\sum_{i=1}^m \frac{1}{f_i(x)} \, \nabla f_i(x)$ then we have that we can define

\[\begin{equation} \lambda^*(t) = -\frac{t}{f_i(x^*(t))}, \label{eq:subst} \end{equation}\]

then the dual becomes (assuming $x^*(t)$ is feasible)

\[f_0(x^*) \geq g(\lambda^*(t), \nu^*(t)) = f_0(x^*(t)) + \sum_{i=1}^m \lambda_i^*(t) f_i(x^*(t)) + \nu^{*}(t)^\top (Ax^*(t) - b) = f_0(x^*(t)) - mt.\]

Hence, $x^*(t)$ is only $mt$ suboptimal so indeed the intuition is correct and we will recover the optimal solution. Note that, without strict convexity, $x^*$ might not be unique!

Finally, note that we do require always work with a feasible $x$. This is again not trivial; in fact, it requires an optimization step in itself (solving $Ax = b$ for $x \leq 0$ is as hard as an LP!). This boils down to solving:

\[\begin{aligned} \text{minimize}_{x, s} \quad & s \\ \text{subject to} \quad & Ax = b, \\ & f_i(x) \le s, \quad i = 1,\dots,n. \end{aligned}\]

Which is AGAIN an inequality constrained optimization problem. However, it suffices to solve this once with a “big” $t$.

The algorithm pieces for the general case

We derive now the Newton’s equations for the general case.

Phase I (finding a feasible initial point)

Construct the Lagrangian for a big $t$:

\[L(x, s, \nu) = s - t\sum_{i = 1}^m \log(s - f_i(x)) + \nu^{\top}(Ax -b).\]

Gradients and Hessians:

\[\begin{align*} \nabla_x L(x, s, \nu) &= \sum_{i=1}^m \frac{t}{s - f_i(x)} \nabla f_i(x) + A^\top \nu, \\ \nabla_s L(x, s, \nu) &= 1 - \sum_{i=1}^m \frac{t}{s - f_i(x)}, \\ \nabla_\nu L(x, s, \nu) &= Ax - b, \end{align*}\] \[\begin{align*} \nabla_x^2 L(x, s, \nu) &= \sum_{i=1}^m \left( \frac{t}{s - f_i(x)}\nabla^2 f_i(x) + \frac{t}{(s - f_i(x))^2} \nabla f_i(x) \nabla f_i(x)^\top \right), \\ \nabla_s^2 L(x, s, \nu) &= \sum_{i=1}^m \frac{t}{(s - f_i(x))^2}, \\ \nabla_{xs}^2 L(x, s, \nu) &= - \sum_{i=1}^m \frac{t}{(s - f_i(x))^2} \nabla f_i(x),\\ \nabla_{\nu x}^2 L(x, s, \nu) &= A, \\ \nabla_{\nu s}^2 L(x, s, \nu) &= 0, \\ \nabla_{\nu \nu}^2 L(x, s, \nu) &= 0. \end{align*}\]

Now, putting it all together:

\[\begin{bmatrix} \nabla_x^2 L & \nabla_{xs}^2 L & A^\top \\ (\nabla_{xs}^2 L)^\top & \nabla_s^2 L & 0 \\ A & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta s \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} \nabla_x L \\ \nabla_s L \\ \nabla_\nu L \end{bmatrix}\]

Centering step

\[L(x, \nu) = f_0(x) - t\sum_{i = 1}^m \log(-f_i(x)) + \nu^{\top}(Ax -b).\] \[\begin{bmatrix} \nabla^2 f_0(x) + t \nabla^2 \phi(x) & A^\top \\ A & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} \nabla f_0(x) + t \nabla \phi(x) + A^\top \nu \\ Ax - b \end{bmatrix}\]

However, since $x$ is feasible, we can focus on the simpler system (this simplification is not applied below, with no effect besides making the whole description and the code a bit more cumbersome):

\[\begin{bmatrix} \nabla^2 f_0(x) + t \nabla^2 \phi(x) & A^\top \\ A & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ \nu \end{bmatrix} = - \begin{bmatrix} \nabla f_0(x) + t \nabla \phi(x) \\ 0 \end{bmatrix}\]

log-barrier Method SVM algorithm

Now we’ve got everything we need. Bear in mind that we are optimizing the dual of the SVM, so we have a maximization problem! Furthermore, note that there are only 2 inequality constraints and 1 equality constraint acting in $\alpha$ as a vector.

First, let’s get the derivatives and Hessians for the Phase I and central path problems.

The Lagrangian for Phase I is:

\[L(\alpha, s, \nu) = s - t \sum_{i=1}^n \left( \log(s + \alpha_i) + \log(s + C - \alpha_i) \right) + \nu (\mathbf{y}^\top \alpha).\] \[\begin{align*} \nabla_\alpha L &= -t \left( \frac{1}{s + \alpha} - \frac{1}{s + C - \alpha} \right) + \nu \mathbf{y} \quad \text{Using elementwise division, } \nu \in \mathbb{R} \text{ (scalar)},\\ \nabla_s L &= 1 - t \sum_{i=1}^n \left( \frac{1}{s + \alpha_i} + \frac{1}{s + C - \alpha_i} \right), \\ \nabla_\nu L &= \mathbf{y}^\top \alpha. \end{align*}\]

Let $d_1 = \frac{1}{(s+\alpha)^2}$ and $d_2 = \frac{1}{(C + s -\alpha)^2}$ elementwise!

\[\begin{align*} \nabla_{\alpha \alpha}^2 L &= t \, \text{diag}(d_1 + d_2), \\ \nabla_{ss}^2 L &= t \sum_{i=1}^n (d_1 + d_2), \\ \nabla_{\alpha s}^2 L &= t (d_1 - d_2). \end{align*}\]

Finally:

\[\begin{bmatrix} t \, \text{diag}(d_1 + d_2) & t(d_1 - d_2) & \mathbf{y} \\ t(d_1 - d_2)^\top & t \sum (d_1 + d_2) & 0 \\ \mathbf{y}^\top & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta \alpha \\ \Delta s \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} \nabla_\alpha L \\ \nabla_s L \\ \mathbf{y}^\top \alpha \end{bmatrix}.\]

The Lagrangian for the Central Path problems is:

\[L(\alpha, \nu) = \mathbf{1}^\top \alpha - \frac{1}{2} \alpha^\top Q \alpha + t \sum_{i=1}^n \Big( \log \alpha_i + \log (C - \alpha_i) \Big) + \nu (\mathbf{y}^\top \alpha),\]

where $Q_{ij} = y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$. Note the change of sign, the penalty goes negative because we are maximizing.

Again, gradients, Hessians, and Newton’s matrix (abusing a bit the elementwise operations…):

\[\begin{align*} \nabla_{\alpha} L &= \mathbf{1} - Q\alpha + t \left( \frac{1}{\alpha} - \frac{1}{C - \alpha} \right) + \nu \mathbf{y}, \\ \nabla_\nu L &= \mathbf{y}^\top \alpha, \end{align*}\]

Let $d_1 = \frac{1}{\alpha^2}$ and $d_2 = \frac{1}{(C -\alpha)^2}$ elementwise!

\[\begin{align*} \nabla_{\alpha \alpha}^2 L &= -Q - t \, \text{diag}(d_1 + d_2,) \\ \nabla_{\nu \alpha}^2 L &= \mathbf{y}^\top ,\\ \nabla_{\nu \nu}^2 L &= 0. \\ \end{align*}\]

And finally

\[\begin{bmatrix} -Q - t \, \text{diag}(d_1 + d_2) & \mathbf{y} \\ \mathbf{y}^\top & 0 \end{bmatrix} \begin{bmatrix} \Delta \alpha \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} \nabla_\alpha L \\ \nabla_\nu L \end{bmatrix}\]

Now all the math is ready, the algorithm then is:

In the above demo we used RBF kernel for coolness purposes, but that has actually nothing to do with the optimization problem since it is literally just a different $Q$. In the code implementation and the above algorithm we are updating $\nu$. This is not necessary and is actually not part of the canonical log-barrier method which is strictly a primal algorithm. Here we are doing a primal-dual algorithm (because of $\nu$), even though this is not necessary since $x$ is guaranteed to be feasible.

Primal-Dual algorithm

The log-barrier method can be interpreted as solving a modified KKT system in each inner loop. A “continuous deformation” that indeed converges to the final canonical KKT system. In particular, again for the standard minimization problem:

\[\begin{aligned} Ax &= b, \\ f_i(x) &\le 0, \quad i = 1,\dots,m, \\ \lambda &\ge 0, \\[6pt] \nabla f_0(x) + \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^{T}\nu &= 0, \\[6pt] -\lambda_i f_i(x) &= t, \quad i = 1,\dots,m. \end{aligned}\]

The only difference is in the complementary slackness condition. In the log-barrier method $\lambda$ is eliminated via substitution (equation \ref{eq:subst}). The idea now is to directly focus on solving the modified KKT directly, explicitly optimizing $\lambda$. This greatly simplifies things. First, now we are allowed to work with not feasible values for primal and dual variables, hence; we skip Phase I altogether (which should make this already trivially twice as fast…). Convergence will ensure feasibility, but it is not a requirement at the start! (Which is why below the evolution of the objective function is not monotonically increasing.) Second, there is only one loop, $t$ is updated at the same time as $\lambda, x, \nu$.

Let’s check the theory again for the standard case and then we fill it in with the SVM problem specifics. The modified KKT equation to solve is:

\[\begin{bmatrix} \nabla^2 f_0(x) + \sum_{i=1}^m \lambda_i \nabla^2 f_i(x) & Df(x)^T & A^T \\ -\mathbf{diag}(\lambda)Df(x) & -\mathbf{diag}(f(x)) & 0 \\ A & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta \lambda \\ \Delta \nu \end{bmatrix} = - \begin{bmatrix} \nabla f_0(x) + Df(x)^{T}\lambda + A^{T}\nu \\ - \operatorname{diag}(\lambda)\, f(x) - t\mathbf{1} \\ Ax - b \end{bmatrix}\]

Define $\hat{\eta}(x, \lambda) = - \sum_{i=1}^m f_i(x)\lambda_i$.

\[\bigg\|\begin{bmatrix} \nabla f_0(x) + Df(x)^{T}\lambda + A^{T}\nu \\ - \operatorname{diag}(\lambda)\, f(x) - t\mathbf{1} \\ Ax - b \end{bmatrix}\bigg\|_2 = 0,\]

then $\hat{\eta}(x, \lambda) = mt$. If $\lambda, x, \nu$ satisfies the modified KKT, then $mt$ is the duality gap. Note that then $t = \frac{\hat{\eta}(x, \lambda)}{m}$. And here we have the idea for the algorithm, we are going to decrease the duality gap by decreasing $t$ (as before, using $\mu \in (0, 1)$) and at the same time forcing feasibility by the minimization of the residual. ¹

Now, turning back to SVMs.

Residuals:

\[\begin{align*} r_{\text{dual}} &= \mathbf{1} - Q \alpha - \lambda^{\top} \begin{bmatrix} -\mathbf{I} \\ \mathbf{I} \end{bmatrix} + \nu y \\ r_{\text{cent}} &= -\text{diag}(\lambda)\begin{bmatrix} -\alpha \\ \alpha - C \end{bmatrix} -t\mathbf{1} \quad \lambda \text{ is positive and the function values are negative (when feasible)}\\ r_{\text{pri}} &= \mathbf{y}^\top \alpha. \end{align*}\]

So the KKT matrix:

\[\begin{bmatrix} - Q & \begin{bmatrix}-I \\ I \end{bmatrix}^\top & y \\ -\text{diag}(\lambda) \begin{bmatrix}-I \\ I \end{bmatrix} & \text{diag}\begin{bmatrix} -\alpha \\ \alpha - C \end{bmatrix} & 0 \\ y^\top & 0 & 0 \end{bmatrix} \begin{bmatrix} \Delta \alpha \\ \Delta \lambda \\ \Delta \nu \end{bmatrix} = -\begin{bmatrix} \mathbf{1} - Q \alpha - \lambda^{\top} \begin{bmatrix} -\mathbf{I} \\ \mathbf{I} \end{bmatrix} + \nu y \\ -\text{diag}(\lambda)\begin{bmatrix} -\alpha \\ \alpha - C \end{bmatrix} -t\mathbf{1}\\ \mathbf{y}^\top \alpha \end{bmatrix}.\]

Note that, while in the log-barrier method we effectively inserted the inequality constraint in the Lagrangian and then continued (sort of) like in equaility constrained optimization, here we are NOT working on finding stationary points of any Lagrangian! Here everything follows from the residual vector. In other words, the modified KKT conditions here do not arise from the Lagrangian.

The line search must ensure that the $\lambda$ stays positive and the constraint functions negative. It also ensures that the norm of the residuals decreases sufficiently. From [1] §11.7:

One iteration of the primal-dual interior-point algorithm is the same as one step of the infeasible Newton method, applied to solving $\, r_t(x, \lambda, \nu) = 0$, but modified to ensure $\, \lambda > 0$ and $\, f(x) < 0$.

Finally, the algorithm:

Coding-wise, this felt much cleaner than the log-barrier method…

Sequential minimal optimization (SMO)

Derivation of the update equations

In 1998 Jonh C. Platt proposed an ad hoc algorithm for SVMs. The idea is basically a coordinate descent approach but with a catch. Again we are going to be dealing with the dual function. Optimizing one multiplier at a time is no possible because the linear equality constraint provides an identity that we have to respect. Imagine we are looking to optimize observation j:

\[- \sum_{i \neq j} y_i\alpha_i = \alpha_j y_j,\]

so we do not have any univariable direction of improvement that respects feasibility. So what now? Turns out that allowing for 2 degrees of freedom at each step actually allows increasing the dual function. The drawback is that then we have $N^2$ possible pairs to optimize, so a naive loop is absolutely terrible. Furthermore, in a real world scenario where SVMs generalize decently it is likely that $\alpha$ will be quite a sparse vector, so there may be updates that do not even increase the dual. In any case, heuristics are needed to solve this.

Let’s first derive the update rules, following [3] [4]. The aim is to exploit a closed form solution for

\[\begin{align*} \text{maximize}_{\alpha} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j \\ \text{subject to} \quad & \sum_{i=1}^n \alpha_i y_i = 0, \\ & 0 \le \alpha_i \le C, \quad i = 1, \dots, n. \end{align*}\]

if we let all $\alpha$ fixed but two, let’s call those two $\alpha_i$ and $\alpha_j$. Then we have that the pair-dual is:

\[D(\alpha_i, \alpha_j) = \alpha_i + \alpha_j +\sum_{k \neq i,j}^n \alpha_k - \frac{1}{2} \left[ \alpha_i^2 K_{ii} + \alpha_j^2 K_{jj} + 2 y_i y_j \alpha_i \alpha_j K_{ij} \right] - y_i \alpha_i \sum_{k \neq i,j}^n y_k \alpha_k K_{ik} - y_j \alpha_j \sum_{k \neq i,j}^n y_k \alpha_k K_{jk} - \frac{1}{2} \sum_{k \neq i,j}^n \sum_{m \neq i,j}^n y_k y_m \alpha_k \alpha_m K_{km}.\]

Where $K_{lm}$ is the inner product of $\phi(x_l)$ and $\phi(x_m)$ ($\phi(x)$ being a function that puts $x$ in another space, e.g., $n^{th}$ degree polynomial expansion). This is, a particular entry of the Kernel matrix.

Now, we have that, to respect the equality constraint:

\[y_i \alpha_i + y_j \alpha_j = - \sum_{k \neq i, j} y_k\alpha_k.\]

Multiplying both sides by $y_i$ (recall that $y$ is either $1$ or $-1$):

\[\alpha_i + y_iy_j \alpha_j = - y_i\sum_{k \neq i, j} y_k\alpha_k = \zeta,\]

hence,

\[\alpha_i = \zeta - y_iy_j \alpha_j .\]

We can plug this in and we get:

\[\begin{aligned} D(\alpha_j) &= \zeta + (1 - y_i y_j)\alpha_j - \frac{1}{2}\Big[ (\zeta - y_i y_j \alpha_j)^2 K_{ii} + \alpha_j^2 K_{jj} + 2 y_i y_j (\zeta - y_i y_j \alpha_j)\alpha_j K_{ij} \Big] \\ &\quad - y_i (\zeta - y_i y_j \alpha_j)\sum_{k\neq i,j} y_k \alpha_k K_{ik} - y_j \alpha_j \sum_{k\neq i,j} y_k \alpha_k K_{jk} + \text{const}. \end{aligned}\]

Rearranging the term within brackets we get:

\[\begin{aligned} (\zeta - y_i y_j \alpha_j)^2 K_{ii} &+ \alpha_j^2 K_{jj} + 2 y_i y_j (\zeta - y_i y_j \alpha_j)\alpha_j K_{ij} \\ &= \zeta^2 K_{ii} + 2 \zeta y_i y_j \alpha_j (K_{ij}-K_{ii}) + \alpha_j^2 (K_{ii}+K_{jj}-2K_{ij}). \end{aligned}\]

Rearranging the rest:

\[\begin{aligned} & \zeta + (1 - y_i y_j)\alpha_j - y_i (\zeta - y_i y_j \alpha_j)\sum_{k\neq i,j} y_k \alpha_k K_{ik} - y_j \alpha_j \sum_{k\neq i,j} y_k \alpha_k K_{jk} \\ &= \zeta + (1 - y_i y_j)\alpha_j - \zeta y_i \sum_{k\neq i,j} y_k \alpha_k K_{ik} + \alpha_j y_j \sum_{k\neq i,j} y_k \alpha_k K_{ik} - \alpha_j y_j \sum_{k\neq i,j} y_k \alpha_k K_{jk} \\ &= \zeta + (1 - y_i y_j)\alpha_j - \zeta y_i \sum_{k\neq i,j} y_k \alpha_k K_{ik} + \alpha_j y_j \sum_{k\neq i,j} y_k \alpha_k (K_{ik}-K_{jk}). \end{aligned}\]

Putting it together:

\[\begin{aligned} D(\alpha_j) &= - \frac{1}{2}\Big[\zeta^2 K_{ii} + 2 \zeta y_i y_j \alpha_j (K_{ij}-K_{ii}) + \alpha_j^2 (K_{ii}+K_{jj}-2K_{ij}) \Big] \\ &\quad - \zeta + (1 - y_i y_j)\alpha_j - \zeta y_i \sum_{k\neq i,j} y_k \alpha_k K_{ik} + \alpha_j y_j \sum_{k\neq i,j} y_k \alpha_k (K_{ik}-K_{jk}) + \text{const} \\ &= -\frac{1}{2}(K_{ii}+K_{jj}-2K_{ij})\alpha_j^2 +\alpha_j\Big[ (1-y_i y_j) -\zeta y_i y_j (K_{ij}-K_{ii}) + y_j \sum_{k\neq i,j} y_k \alpha_k (K_{ik}-K_{jk}) \Big] +\text{const}. \end{aligned}\]

Now we have a nice for for our scalar quadratic equation. However, we need to do a bit more work with the linear term. We have that:

\[y_j \sum_{k\neq i,j} y_k \alpha_k (K_{ik}-K_{jk}) = y_j \Bigg(\Big[u_i - b - \alpha_i y_i K_{ii} - \alpha_j y_j K_{ji} \Big] - \Big[u_j - b - \alpha_j y_j K_{jj} - \alpha_i y_i K_{ij} \Big]\Bigg),\]

where

\[u_i = \sum_{k=1}^N \alpha_k y_k K_{ki} + b,\]

this is, the prediction for the value $x_i$ in the hyperplane (note $w = \sum_{k=1}^N \alpha_k y_k \phi(x_k)$). Now, let $E_k = u_k - y_k$ be the error of the prediction. Substituting $\zeta$ back:

\[\begin{aligned} & (1-y_i y_j) -\zeta y_i y_j (K_{ij}-K_{ii}) + y_j \Bigg(\Big[u_i - b - \alpha_i y_i K_{ii} - \alpha_j y_j K_{ji} \Big] - \Big[u_j - b - \alpha_j y_j K_{jj} - \alpha_i y_i K_{ij} \Big]\Bigg) \\ &= (1-y_i y_j) + ( \alpha_iy_i y_j + \alpha_j) (K_{ii} - K_{ij}) + \Bigg( y_j(u_i - u_j) + \alpha_j K_{jj} - \alpha_i y_i y_j K_{ii} + \alpha_i y_j y_i K_{ji} - \alpha_j K_{ji}\Bigg) \\ & = (1-y_i y_j) + ( \alpha_i y_i y_j)(K_{ii} - K_{ij} - K_{ii} + K_{ji}) + (\alpha_j)(K_{ii} - K_{ij} + K_{jj} - K_{ji}) + y_j(u_i - u_j) \\ & = (1-y_i y_j) + \alpha_j(K_{ii} - 2K_{ij} + K_{jj}) + y_j(u_i - u_j).\\ \end{aligned}\]

And finally, since $E_i - E_j = u_i - y_i - u_j + y_j$:

\[\begin{aligned} &\alpha_j(K_{ii} - 2K_{ij} + K_{jj}) + y_j(E_i - E_j) \\ &= \alpha_j(K_{ii} - 2K_{ij} + K_{jj}) + y_ju_i - y_j y_i - y_j u_j + 1 \\ &= \alpha_j(K_{ii} - 2K_{ij} + K_{jj}) + (1-y_i y_j) + y_j(u_i - u_j). \end{aligned}\]

Hence we are ready to get the analytical solution:

\[\begin{aligned} D(\alpha_j) = -\frac{1}{2}(K_{ii}+K_{jj}-2K_{ij})\alpha_j^2 +\alpha_j\Big[ \alpha_j^{old}(K_{ii} - 2K_{ij} + K_{jj}) + y_j(E_i - E_j) \Big] +\text{const}. \end{aligned}\]

Importantly, note that $\alpha_j^{old}$ inside the brackets is a given value. Taking the derivative and equating to 0 we arrive at the solution:

\[\begin{equation} \alpha_j^{new} = \alpha_j^{old} + \frac{y_j(E_i - E_j)}{(K_{ii}+K_{jj}-2K_{ij})}, \quad \alpha_i^{new}= \zeta - y_jy_i \alpha_j^{new} = \alpha_i^{old} + y_i y_j(\alpha_j^{old} - \alpha_j^{new}). \label{eq:updateSMO} \end{equation}\]

The second equation follows from the fact that $y_jy_i\alpha_j^{old} + \alpha_i^{old} = \zeta$ since this is respected throughout.

Great! We’ve got a solution. Nevertheless, we still need to enforce the “box” constraints. There are two cases, either $y_iy_j$ is $1$ or $-1$. If it is $1$ then

\[\alpha_j + \alpha_i = \zeta.\]

Since $\alpha$ cannot be higher than $C$ or below 0. We have that if $\zeta < C$ the minimum for $\alpha_j$ is 0 and the maximum $\zeta$. On the other case, if $\zeta \geq C$ then the minimum is $\zeta - C$ (which implies $\alpha_i = C$ which is the upper limit) and the maximum is $C$. Hence the lower bound for $y_iy_j = 1$ is $\text{max}(0, \zeta - C)$ and the upper bound $\text{min}(\zeta, C)$. Reasoning in the same way for $y_iy_j = -1$ we have that the lower bound is $\text{max}(0, -\zeta)$ and the upper bound $\text{min}(C - \zeta, C)$. Those bounds have to be respected acting like a clip on the update, they ensure that the $\alpha$ remains feasible (e.g., think of projected gradient descent).

Finally, the updates in:

\[\begin{aligned} w^{new} &= \sum_{n \neq i, j} y_n \alpha^{old}_n x_n + y_j \alpha_j^{new}x_j + y_i \alpha_i^{new}x_i\\ &= w^{old} + y_j \alpha_j^{new}x_j - y_j \alpha_j^{old}x_j + y_i \alpha_i^{new}x_i - y_i \alpha_i^{old}x_i\\ &= w^{old} + y_j \triangle \alpha_jx_j + y_i \triangle \alpha_ix_i\\ \end{aligned}\]

And for $b$ we have two cases, if $\alpha_i$ or $\alpha_j$ are not at the bounds (note that in that case we have a support vector for at least one of the two), and hence, the error is 0. In that case we look for the $b$ that forces $u_k = y_k$. Suppose that $\alpha_i$ is not saturated, then:

\[b_i = y_i - w^{new}\phi(x_k) = \sum_{n=1}^N y_n \alpha^{new}_n K_{ni},\]

the same applies for $b_j$. If neither is saturated, either works. If both are saturated (non-support vectors or just wrongly classified), we take the average. The above formulation is intuitive but slow; in practice, it is better to use:

\[\begin{aligned} b_1 &= b^{old} - E^{old}_i - y_i(\alpha_i^{new} - \alpha_i^{old})K_{ii} - y_j(\alpha_j^{new} - \alpha_j^{old})K_{ij} \\ b_2 &= b^{old} - E^{old}_j - y_i(\alpha_i^{new} - \alpha_i^{old})K_{ij} - y_j(\alpha_j^{new} - \alpha_j^{old})K_{jj}. \end{aligned}\]

Which arises from (assuming $E_k^{new} = 0$):

\[\begin{aligned} & E_k^{new} - E_k^{old} = \sum_{n=1}^N y_n \alpha^{new}_n K_{ni} +b^{new}- \sum_{n=1}^N y_n \alpha^{old}_n K_{ni} - b^{old} \quad \text{focusing only on the terms that change}\\ & = y_i(\alpha_i^{new} - \alpha_i^{old})K_{ij} - y_j(\alpha_j^{new} - \alpha_j^{old})K_{jj} b^{new} - b^{old}. \\ \end{aligned}\]

Pair selection heuristics

As mentioned, a naive implementation has $N^2$ pairs to optimize. However, SMO uses heuristics to reduce the candidates. First, we have to ensure that at least one of the multipliers is violating KKT conditions. Second, the support vectors (non-saturated Lagrange multipliers) are the most likely to play a big role. The first heuristic, then, is loop over all the dataset and update the multipliers that violate KKT conditions, then loop only over the support vectors (several times if needed) until all support vectors comply with KKT conditions. Then the process is repeated, to make sure “no more” updates are possible. When no updates are made, the algorithm stops.

The second heuristic is to select the second multiplier once the first is selected. In the implementation for this post, we tried to use the argmax of the denominator of the update of $\alpha_j$ to select just the best step for each $\alpha_j$. While it worked ok-ish, it did not converge very close to the sklearn implementation of the SVM. Using all indices seemed to work perfect but is very wasteful. Taking just the top 50 seemed like an ok compromise so in this implementation we do 50 updates for each $\alpha_j$ focusing on the top $\alpha_i$ that are more likely to yield a change. Another option could be to compute the denominator, but this implies computing the whole kernel matrix. In any case, this likley is very far away from actual top performing implenentations of the algorithm, but that is beyond the post.

One particularly interesting bit of the SMO implementation is that, since the error at the beginning is known ($y$, since $w = 0$ and $b = 0$) and we update just a couple of multipliers at a time, we do not need to recompute all the errors but change the errors associated with the multipliers that we updated. This is very important.

Putting all together

The algorithm is relatively simple without the heuristics (essentially a “coupled” coordinate descent). In the code it looks something like follows:

In the optimization process one can tell the difference from the “global” Newton’s method step.

SVM in MNIST: Telling a 5 from a 3

An important thing to note is that, when selecting the support vectors to infer $b$ (for support vectors prediction matches the label exactly), we have to allow for some numerical room. This is, the filter has to be something like $0 - \epsilon < \alpha < C +\epsilon$. Finally, after all this, we can contemplate how the default SVM implementation obliterates our algorithms! This is obviously expected since sklearn SVM optimization process is based on libsvm which offers a highly optimized SMO algorithm. For some reason the gamma parameter in sklearn-SVM behaves differently than in my implementation. In any case, here are the results for 1232 training examples in 256 dimensions.

Quite cool that the algorithms are somewhat decent, taking into account this was pure numpy!

References

These notes are a personal synthesis and extensions based on lectures from Mastermath (NL) Continuous Optimization Course 2025 along with standard references (below). Any errors are my own.

Boyd, S., & Vandenberghe, L.
Convex Optimization
Vapnik, V. N.
The Nature of Statistical Learning Theory
HMC resource: SMO
John C. Platt A fast algorithm for Training SVM

Footnotes

Steepest descent: A geometrical take from gradient descent to Newton’s method.

2025-12-25T00:00:00+00:00

Gradient descent with exact line search: convergence analysis

A key element in understanding gradient descent is its relationship with the condition number of the Hessian of the function we are minimizing. This will give an interesting framework to understand Newton’s method advantages. Let’s follow §9.3.1 from [1]. The idea here is to relate how many iterations we need to reach the optimum using gradient descent to the condition number of the Hessian.

The gradient descent algorithm with exact line search:

Algorithm: Gradient descent with exact line search

Given x0 in R^n, tolerance epsilon > 0

for k = 0, 1, 2, ...:
    g_k = grad f(x_k)
    if ||g_k|| <= epsilon:
        return x_k
    alpha_k = argmin_{alpha > 0} f(x_k - alpha*g_k)
    x_{k+1} = x_k - alpha_k * g_k

Note that this involves solving:

\[\alpha_k = \arg\min_{\alpha > 0} f(x_k - \alpha g_k),\]

which may or may not be difficult. For simplicity, let us assume that the exact optimal step is available.

Let $f(x): \mathbb{R^n} \rightarrow \mathbb{R}$ be a twice differentiable and strongly convex quadratic function (we relax this later).

In what follows we first derive a lower bound on the norm of the gradient and then an upper bound on the distance to the objective value after one iteration of the algorithm. The idea is to provide an upper bound on the distance to the optimal value as a function of the previous step.

Consider the second order Taylor approximation to $f$ around $x$ with Lagrange remainder:

\[f(y) = f(x) + \nabla f(x)^{\top}(y-x) + \frac{1}{2} (y-x)^{\top}\nabla ^2 f(z) (y - x),\]

for some $z = \theta x + (1-\theta) y \quad $ with $\theta \in [0, 1]$. Since $f$ is an strictly convex function we have that the second derivative must be positive definite ¹. Now, we want to find a lower and upper bound of $\frac{1}{2} (y-x)^{\top}\nabla ^2 f(z) (y - x)$. By considering the smallest eigenvalue $\nu_{min}$ of $\nabla ^2 f(z)$ ² we obtain a lower bound:

\[\begin{equation} f(y) \geq f(x) + \nabla f(x)^{\top}(y-x) + \frac{1}{2} \nu_{min}\|y - x\|^2_2. \label{eq:lowb} \end{equation}\]

From here, we can derive a lower bound on the difference to the optimal function value. Looking for the $\bar{y}$ that minimizies the r.h.s. of equation \ref{eq:lowb}. Equating the first derivative w.r.t $y$ to $0$:

\[\nabla f(x) + \nu_{min}(y-x) = 0,\]

then,

\[\bar{y} = x - \frac{\nabla f(x)}{\nu_{min}}.\]

Hence, plugging in, we have

\[\begin{align*} f(x) &+ \nabla f(x)^{\top}(y - x) + \frac{1}{2}\nu_{\min}\|y - x\|_2^2 \\ &\ge f(x) + \nabla f(x)^{\top}\!\left(-\frac{\nabla f(x)}{\nu_{\min}}\right) + \frac{1}{2}\nu_{\min} \|\frac{\nabla f(x)}{\nu_{\min}}\|_2^2 \\ &= f(x) - \frac{1}{\nu_{\min}}\|\nabla f(x)\|_2^2 + \frac{1}{2\nu_{\min}}\|\nabla f(x)\|_2^2 \\ &= f(x) - \frac{1}{2\nu_{\min}}\|\nabla f(x)\|_2^2 . \end{align*}\]

Finally, since the above holds for any $y$, we have that

\[\begin{equation} 2\nu_{\min}(-p^* + f(x)) \leq \|\nabla f(x)\|_2^2 \label{eq:lowbound_g} \end{equation}\]

Now for the upper bound:

\[\begin{equation} f(y) \leq f(x) + \nabla f(x)^{\top}(y-x) + \frac{1}{2} \nu_{max}\|y - x\|^2_2. \label{eq:uppb} \end{equation}\]

Let’s plug in the line search step in the above equation \eqref{eq:uppb} :

\[\begin{aligned} f(x - \alpha\nabla f(x)) &\leq f(x) + \nabla f(x)^{\top}\big((x - \alpha\nabla f(x)) - x\big) + \tfrac{1}{2}\nu_{\max}\|(x - \alpha\nabla f(x)) - x\|_2^2 \\ &= f(x) + \nabla f(x)^{\top}\big(-\alpha\nabla f(x)\big) + \tfrac{1}{2}\nu_{\max}\|(\alpha\nabla f(x))\|_2^2 \\ &= f(x) - \alpha\|\nabla f(x)\|^2_2 + \tfrac{\alpha^2}{2}\nu_{\max}\|\nabla f(x)\|_2^2. \end{aligned} \\\]

Now, minimizing over $\alpha$ (is a simple quadratic equation):

\[-\|\nabla f(x)\|^2_2 + \alpha\nu_{\max}\|\nabla f(x)\|_2^2 = 0\]

so the optimal $\alpha$:

\[\alpha = \frac{\|\nabla f(x)\|^2_2}{\nu_{\max}\|(\nabla f(x))\|_2^2} = \frac{1}{\nu_{\max}}.\]

Plugging in:

\[\begin{aligned} f(x - \alpha\nabla f(x)) & \leq f(x) - \frac{1}{\nu_{\max}}\|\nabla f(x)\|^2_2 + \frac{1}{2\nu_{\max}^2}\nu_{\max}\|\nabla f(x)\|_2^2 \\ &= f(x) - \frac{1}{\nu_{\max}}\|\nabla f(x)\|^2_2 + \frac{1}{2\nu_{\max}}\|\nabla f(x)\|_2^2 \\ &= f(x) - \frac{1}{2\nu_{\max}}\|\nabla f(x)\|^2_2. \end{aligned} \\\]

Substracting the optimal value $p^*$ and using equation \ref{eq:lowbound_g}:

\[\begin{aligned} f(x - \alpha \nabla f(x)) - p^* \leq f(x) - p^* - \frac{1}{2\nu_{\max}}\|\nabla f(x)\|^2_2 \leq f(x) - p^* - \frac{\nu_{\min}}{\nu_{\max}}(-p^* + f(x)) = (1 - \frac{\nu_{\min}}{\nu_{\max}})(f(x) - p^*) \end{aligned} \\\]

We are almost there. Appartenly if $\frac{\nu_{\min}}{\nu_{\max}} = 0$ is possible that we make no progress towards the optimum, since the value of the function after applying the optimal step may be the same. Let’s look closer:

Let

\[\begin{aligned} x_{k+1} := x_k - \alpha_k \nabla f(x_k),\\ \rho := 1 - \frac{\nu_{\min}}{\nu_{\max}}. \end{aligned} \\\]

Then, since $f(x_{k+1}) - p^* \leq \rho \bigl(f(x_k) - p^*\bigr)$, we have that

\[f(x_{k+2}) - p^* \leq \rho \bigl(f(x_{k+1}) - p^*\bigr) \leq \rho^2 \bigl(f(x_k) - p^*\bigr),\]

So, we have found a way of constructing an upper bound for any iteration in the algorithm based on the initial difference and the the smallest and biggest eigenvalue of the Hessian:

\[f(x_{k}) - p^* \leq \rho^k \bigl(f(x_0) - p^*\bigr).\]

Solving for $k$, we can make an explicit formula for the maximum number of iterations required to reach a particular loss. In particular, let $\epsilon$ be the desired suboptimality, we have that $f(x_{k}) - p^* \leq \epsilon$ if:

\[\begin{aligned} \rho^n = \frac{\epsilon}{\bigl(f(x_0) - p^*\bigr)}; \\ \log(\rho)n = \log(\frac{\epsilon}{\bigl(f(x_0) - p^*\bigr)}); \\ n = \frac{\log(\frac{\epsilon}{\bigl(f(x_0) - p^*\bigr)})}{\log(\rho)};\\ n = \frac{\log(\frac{\bigl(f(x_0) - p^*\bigr)}{\epsilon})}{\log(\frac{1}{\rho})} \\ \end{aligned}\]

Looking closer at the denominator we see that, if the condition number of the Hessian is big, $\rho$ approaches 1 and hence the denominator approaches 0, increasing the number of iterations required!

A note for arbitrary convex functions

Above, treating the quadratic case allowed to make intuitive use of the maximum and minimum eigenvalue. However, in general, this is not possible since the Hessian will be a function of $x$. Nevertheless, the above reasoning holds when changing the maximum and minimum eigenvalue by $M$ and $m$ respectively:

\[\nabla^2 f(x) - m I \in \mathbb{S}^n_+, \qquad M I - \nabla^2 f(x) \in \mathbb{S}^n_+ \quad \forall x \in Q.\]

For some set $Q$ being the domain of $f$, e.g. $\mathbb{R}^n$. $\mathbb{S}^n_+$ is the set of PSD matrices.

Steepest descent

Gradient descent is a particular instance of a more general framework. Let us now consider steepest descent, we will follow §9.4 from [1].

Again, we start with a Taylor approxmation, in this case, a first order one:

\[f(x + v) \approx f(x) + \nabla f(x)^{\top}(v).\]

The idea is now to minimize this approximation with respect to $v$, but with a constraint on the norm of $v$. The “normalized” steepest descent direction is defined then as:

\[\triangle x_{nsd} = \text{argmin}\{\nabla f(x)^\top v \: |\: \|v\| \leq 1 \},\]

for some norm. Selecting this norm is the key here.

We can also look at the “unnormalized” steepest descent direction:

\[\triangle x_{sd} = \|\nabla f(x)\|_*\triangle x_{nsd}.\]

We have that:

\[\|\nabla f(x)\|_* = \text{sup}\{\nabla f(x)^\top z \: : \|z\| \leq 1\}\]

i.e. what is the maximum directional derivative of $f$ at $x$ over all directions with unit norm? Following that line of thought we must have that:

\[\nabla f(x)^\top \triangle x_{nsd} = -\|\nabla f(x)\|_*.\]

Why? Because $ \triangle x_{nsd}$ is trying to find the direction with unit norm that minimizes the directional derivative. So, evaluating that directional derivative is exactly the definition of the dual norm up to a sign.

Euclidean norm: Gradient Descent

For the euclidean norm we have that, since we need a vector whose angle is 0 with the gradient but still unit norm,

\[\triangle x_{nsd} = -\frac{\nabla f(x)}{\|\nabla f(x)\|_2},\]

and hence,

\[\triangle x_{sd} = \|\nabla f(x)\|_*\triangle x_{nsd} = -{\|\nabla f(x)\|_2}\frac{\nabla f(x)}{\|\nabla f(x)\|_2} = -\nabla f(x).\]

Quadratic norm: Newton’s Method

The description above was intended to motivate Newton’s method advantage over gradient descent. In particular, the problem of gradient descent with high condition numbers of the Hessian even for simple quadratic equations.

Turns out that we can propose a change of basis where this problem dissapears! In fact, for quadratic equations we will need just one iteration. While this is somewhat obvious when treating Newton’s method as a root finding agorithm for a second degree Taylor approximation to the function, that interpretation does not illustrate the interesting geometry that is going on “behind” it.

So, let’s consider the quadratic norm to see how to get rid of the problematic condition number:

\[\|z\|_P = (z^\top P z)^{0.5} = \|P^{0.5}z\|_2,\]

where $P \in \mathbb{S}_{++}$ (a positive definite matrix). Let us then consider

\[\triangle x_{nsd} = \text{argmin}\{\nabla f(x)^\top v \: |\: \|v\|_P \leq 1 \}.\]

We need to solve this inequality constrained optimization problem. Here we solve it in a simple way, but there are other ways³:

\[\begin{aligned} \triangle x_{nsd} &= \text{argmin}\{\nabla f(x)^\top v \: |\: \|v\|_P \leq 1 \}, \\ &= \text{argmin}\{\nabla f(x)^\top v \: |\: v^tPv = 1 \}, \quad \text{since is a linear objective function, constraint must be saturated} \\ &= \text{argmin}\{\nabla f(x)^\top v \: |\: (P^{0.5}v)^\top(P^{0.5}v) = 1 \}, \\ &= P^{-0.5}\text{argmin}\{\nabla f(x)^\top P^{-0.5}u \: |\: u^\top u = 1 \}, \quad v = P^{-0.5}u \\ &= P^{-0.5}\text{argmin}\{ (P^{-0.5}\nabla f(x))^\top u \: |\: \| u \|_2= 1 \} \\ &= \frac{P^{-1}\nabla f(x)}{\| P^{-0.5}\nabla f(x) \|_2}.\\ \end{aligned}\]

And then $\triangle x_{sd} = P^{-1}\nabla f$.

Change of basis interpretation

Let $\bar{x} = Ax$ and $\bar{f}(\bar{x}) = f(A^{-1}\bar{x})$ and let $A \in \mathbb{S}_{++}^n$. Essentially, $\bar{f}$ is a function that describes the same level sets in another basis:

Illustration of a coordinate transformation: the left plot shows the original function, and the right plot shows the function on a different basis. The quadratic norm allows to pursue gradient descent for the same function on a different basis. In particular the linear transformation consist of an stretch of the vertical axis. Importantly, in this basis the condition number of the Hessian has improved.

Consider the derivative:

\[\nabla \bar{f}(\bar{x}) = A^{-1}\nabla f(A^{-1}\bar{x}) = A^{-1}\nabla f(x).\]

Now, mapping back to the original space, the search direction is:

\[v = A^{-2}\nabla f(x)\]

So, if we let $A = P^{0.5}$ then note that this has the same form as the normalized Steepest Descent direction for the quadratic norm. This shows that Steepest Descent with a quadratic norm is just gradient descent on a different basis ⁴ !

Newton’s Method algorithm

We now have all the pieces (it was more difficult than I expected…)!

Let $A = (\nabla^2 f(x))^{0.5}$. Then, $\nabla^2 \bar{f}(\bar{x}) = I$. The condition number is 1.

And that’s it, Newton’s method is a gradient descent method with change of basis that aliviates the condition number of the Hessian.

Algorithm: Newton's method

Given x0 in R^n, tolerance epsilon > 0

for k = 0, 1, 2, ...:
  Calculate Newton's step and decrement:
    v = -h_k(x)^{-1} g_k(x)
    lambda = g_k(x)^T h_k(x)^{-1} g_k(x)
  If lambda < epsilon:
    Quit
  Line search:
    Get alpha_k
  Update:
  x_k = x_{k-1} + alpha_k v

Convergence analysis for Newton’s method

This is going to be a digestion of §9.5.3 from [1], hopefully bringing us some intuition. Alright, recall that for quadratic functions Newton’s method is trivial, and in fact, those can be solved analytically. So here we will focus on arbitrary strongly convex and differentiable functions. Additionally we are going to assume:

\[\|\nabla^2 f(x) - \nabla^2 f(y) \|_2 \leq L\|x-y\|_2,\]

i.e. the Hessian is Lipschitz continuous. Essentially, $L$ will measure how good is the second order Taylor approximation for $f$.

The idea is to show that if we start close enough to the optimum $x^*$ then Newton’s method will converge quadratically. In this case we consider iterates of Newton’s method with unit stepsize (“pure Newton’s step”).

Let $\triangle_{nt} = -(\nabla^2 f(x))^{-1} \nabla f(x)$:

\[\begin{aligned} \|\nabla f(x_k + \triangle_{nt}^k)\|_2 = \|\nabla f(x_{k+1})\|_2 &= \| \nabla f(x_k + \triangle_{nt}^k) - \nabla f(x_k) + \nabla^2 f(x_k) (\nabla^2 f(x))^{-1} \nabla f(x)\|_2 \\ &= \| \nabla f(x_k + \triangle_{nt}^k) - \nabla f(x_k) - \nabla^2 f(x_k) \triangle_{nt}^k\|_2 \\ &= \| \int_0^1 \big( \nabla^2 f(x_k + t\triangle_{nt}^k) - \nabla^2 f(x_k) \big) dt \triangle_{nt}^k\|_2 \quad \text{(Fundamental theorem of calculus)} \\ &\leq \int_0^1 \| \big( \nabla^2 f(x_k + t\triangle_{nt}^k) - \nabla^2 f(x_k) \big) \triangle_{nt}^k\|_2dt \quad \text{(Triangle inequality)} \\ &\leq \int_0^1 \| \big( \nabla^2 f(x_k + t\triangle_{nt}^k) - \nabla^2 f(x_k) \big)\|_2\|\triangle_{nt}^k\|_2dt \quad \text{(Cauchy–Schwarz inequality)} \\ &\leq \int_0^1 L(t \|\triangle_{nt}^k \|_2 ) \|\triangle_{nt}^k \|_2 dt \quad \text{(Lipschitz continuity definition)} \\ &= \frac{L}{2}\|-(\nabla^2 f(x))^{-1} \nabla f(x)\|^2_2 \\ &\leq \frac{L}{2}\| m^{-1}\nabla f(x_k)\|^2_2 \quad \text{(Effect of the Hessian is bounded below by smallest eigenvalue possible in the domain)} \\ &= \frac{L}{2m^2}\| \nabla f(x_k)\|^2_2.\\ \end{aligned}\]

So, the norm of the gradient at $k+1$ is bounded above by some value proportional to the gradient at the $k$. We have that ⁵ $f(x) - x^* \leq \frac{1}{2m} \| \nabla f(x)\|_2^2$ and that $\frac{L}{2m^2} \|\nabla f(x_{k+1})\|_2 \leq \big( \frac{L}{2m^2}\| \nabla f(x_k)\|_2 \big)^2$ so:

\[\begin{aligned} f(x) - x^* &\leq \frac{1}{2m} \| \nabla f(x_k)\|_2^2 \\ &= \frac{2m^3}{L^2} \big(\frac{L}{2m^2}\| \nabla f(x_k)\|_2 \big)^2 \\ &\leq \frac{2m^3}{L^2} \big(\frac{L}{2m^2}\| \nabla f(x_{k-1})\|_2 \big)^4 \\ &\leq \cdots \\ &\leq \frac{2m^3}{L^2} \big(\frac{L}{2m^2}\| \nabla f(x_{0})\|_2 \big)^{2^{k + 1}} \\ \end{aligned}\]

So, if $| \nabla f(x_{0})|_2 < \frac{2m^2}{L}$ then covergence is extremely fast! Much faster than gradient descent. However, if we are far away from the optimum the algorithm could be very slow or diverge. In that case, line search is essential.

Importantly, note that the notion of distance depends on $L$ and $m$. In particular, note that if $L$ is huge, the gradient must be very small in order to enter the quadratic convergence regime! The more stable the Hessian is, the better. Intuitively, lower $L$ implies that the function resembles a quadratic function and hence the second order Taylor approximaton is better. This is particularly relevant for log-barrier methods but we will not dive into that here.

Some comments on equality constrained optimization

Let’s consider the following minimization problem:

\[\begin{aligned} \text{minimize}_{x \in \mathbb{R}^n} \quad & f(x) \\ \text{subject to} \quad & Ax = b \end{aligned}\]

Assume $f(x)$ is strongly convex and differentiable as before. Furthermore, $A$ encodes linear constraints and we have less constraints than dimensions $n » m$. This implies that the system of equations is underdetermined and there are infinitely many solutions in a subspace of dimension $n-m$, leaving room for optimization. We want to find the solution that minimizes $f(x)$ in that subspace.

The Lagrangian encodes the idea that, at the optimum, the derivative of the function must be proportional to the derivative of the constraint (otherwise we could move in some direction orthogonal to the constraint and improve the function value while staying feasible):

\[L(x, \nu) = f(x) + \nu^{\top}(Ax - b),\]

defines a saddle, we want to minimize the function w.r.t $x$ and maximize it w.r.t $\nu$. The optimum is the critical point of the saddle. There is much more to this, but we do not dive into it here.

Gradient descent

Importantly, minimizing the Lagrangian is not defined, it does not have a minimum. Hence, applying gradient descent to the Lagrangian makes no sense. Treating $\nu$ as a hyperparameter and use gradient descent is likely wrong in most of the cases.

Another, but still not proper, way to go about this is to use gradient descent on the euclidean norm of the derivative of the Lagrangian.

\[\begin{equation} \begin{aligned} \nabla_x L(x, \nu) &= (\nabla f(x) + A^{\top} \nu), \\ \nabla_{\nu} L(x, \nu) &= (Ax - b) \\ \end{aligned} \label{eq:derivatives_const} \end{equation}\]

Then,

\[\begin{aligned} \frac{1}{2} \|F\|^2_2 &= \frac{1}{2} \Big( \|\nabla f(x) + \nu^\top A \|_2^2 + \|Ax - b\|_2^2 \Big) \\ &= \frac{1}{2} \bigg( (\nabla f(x) + \nu^{\top}A)^{\top} (\nabla f(x) + \nu^{\top}A) + (Ax - b)^{\top}(Ax - b) \bigg) \\ &= \frac{1}{2} \bigg( \|\nabla f(x)\|^2_2 + 2\nu^{\top} A \nabla f(x) + \nu^{\top}AA^{\top}\nu + x^{\top}A^{\top}Ax - 2x^{\top}A^{\top}b + \|b\|^2_2 \bigg). \end{aligned}\]

Taking derivatives (again):

\[\begin{aligned} \nabla_x \frac{1}{2}\|F\|^2_2 = \nabla^2 f(x) \big( \nabla f(x) + \nu^{\top}A \big) + A^{\top}(Ax -b) \\ \nabla_{\nu} \frac{1}{2}\|F\|^2_2 = A \nabla f(x) + AA^{\top}\nu \end{aligned}\]

Here we explicitly traget the critical point of the Lagrangian. This is also a bad idea, since it requires calculation of the Hessian but it still is a first order method.

An example of a proper approach is projected gradient descent (PGD). In this case it is quite simple, we only need to project the gradient update $\nabla f(x)$ into the kernel of $A$, starting from a feasible point (assuming constraints are linearly independent):

\[g_{pgd} = (I - A^{\top}\big(AA^{\top}\big)^{-1}A)\nabla f(x) \quad \text{Remove from } \nabla f(x) \text{ what lies in the row space}.\]

Here we achieve that $A g_{pgd} = 0$ and hence $A (x + g_{pgd} ) = Ax = b$ so we have feasibility if the original point was feasible.

Just an example of minimizing the norm of the gradient (root seeking) of the lagrangian and Projected GD. The function under consideration is $\frac{1}{2}x^{\top}Qx + qx$. In this case $x \in R^{1000}$ and $A \in R^{100\times 1000}$. For both algorithms a fixed step size of 0.001 is (just to keep it simple), loop stops if gradient is below 0.0001.

Importantly, while projected gradient descent enforces feasibility during the whole training process, minimizing the norm of the gradient of the Lagrangian reaches feasibility asymptotically.

Newton’s method application

While gradient descent cannot be applied directly, Newton’s method is a root seeking algorithm so it might be applied directly to the Lagrangian⁶ . We can start from equations in \ref{eq:derivatives_const}, then calculate the second derivative and we have the Newton system:

\[\begin{pmatrix} \nabla^2 f(x_k) & A^\top \\ A & 0 \end{pmatrix} \begin{pmatrix} \Delta x_k \\ \Delta \nu_k \end{pmatrix} = -\begin{pmatrix} \nabla f(x_k) + A^\top \nu_k \\ Ax_k - b \end{pmatrix} = -\begin{pmatrix} \nabla f(x_k) + A^\top \nu_k \\ 0 \end{pmatrix}\]

and the update is

\[x_{k+1} = x_k - \Delta x_k, \quad \nu_{k+1} = \nu_k - \Delta \nu_k.\]

There is an important distinction in the actual algorithm if the start is or not feasible ($Ax = b$). If the start is not feasible we are essentially looking at a primal-dual optimization algorithm and we look at minimizing the residual vector. This has implications in the line search, convergence criteria and so on. The second equality in the Newton system would not be true in that case. However, it is often pretty simple to provide a feasible starting point. In that case, we can actually do some simple algebra

\[\begin{aligned} \nabla^2 f(x_k) \Delta x_k + A^\top \Delta \nu_k &= - \nabla f(x_k) - A^\top \nu_k \\ \nabla^2 f(x_k) \Delta x_k &= - \nabla f(x_k) - A^\top (\nu_k + \Delta \nu_k) \end{aligned}\]

so let $\nu = (\nu_k + \Delta \nu_k)$ to reach the equivalent simpler system:

\[\begin{pmatrix} \nabla^2 f(x_k) & A^\top \\ A & 0 \end{pmatrix} \begin{pmatrix} \Delta x_k \\ \nu \end{pmatrix} = -\begin{pmatrix} \nabla f(x_k) \\ 0 \end{pmatrix}.\]

In the latter we actually do not care much about $\nu$ until convergence. Since it does not play any role in determining the updates for $x$.

In the above example, a very naive implementation of Newton’s method takes only 0.068 seconds.

References

These notes are a personal synthesis and extensions based on lectures from Mastermath (NL) Continuous Optimization Course 2025 along with standard references (below). Any errors are my own.

Boyd, S., & Vandenberghe, L.
Convex Optimization

Footnotes

1. Why convexity requires second derivative PSD?

Pendulum, equations of motion from first principles and solutions using ML.

2024-07-12T00:00:00+00:00

References

These notes are based on:

[1] Hardvard notes on Lagrangian method

[2] Classical Mechanics: The Theoretical Minimum

[3] Deep learning for universal linear embeddings of nonlinear dynamics

Introduction

Lately I have been reading “Classical Mechanics: The theoretical minimum”. Since I mostly deal with purely data-driven approaches, I wanted to dive a bit into the first-principle derivation of predictive models.

Here I look into the simplest pendulum, a mass whose motion is restricted to a unit circle.

First I derive a differential equation in which we can study the motion of the pendulum. Then I solve this differential equation in a data-driven way (i.e. construct a function of $t$ that returns the angle and velocity of the pendulum given some initial conditions) using a linear model on a coordinate system generated by an autoencoder.

Derivation of the potential energy from first principles

Potential energy is just the amount of work that is potentially stored. In this case it is clear that when the red ball is at the top of the circle (0,1) the potential energy is maximum. This is because we are assuming that the only acting force here is gravity, so the higher up, the more potential energy. Then, we can be sure that the total potential energy is proportional to the y coordinate.

\[V(\theta) = mg(1-\cos(\theta))\]

For $\theta$ being the angle between the black and red line (see above .gif). On $(0, 1)$ the potential energy is at is maximum since $\cos(\pi) = -1.$ $mg$ is the constant to which the potential energy is proportional. Intuitively, this should be how much is the pull of the force down (gravity, in earth $9.8 m/s^2$) and the mass of the ball (assumed to be 1).

So far I have only used the fact that, the higher up, the more potential energy.

Derivation of the kinetic energy from first principles

Assuming that the acceleration is constant (which makes sense because the force applied is constant), kinetic energy is defined as $T(\theta) = \frac{1}{2}mv^2$. This comes from the sum of the work applied over distance, distance being $d = \frac{1}{2}vt$ (assuming speed 0 at time 0, think of the area of a triangle) and work per unit of time being the increase of momentum, $F = ma = m\frac{v}{t}$ because $a$ is constant, so $W = Fd = mad = \frac{1}{2}mv^2.$

Surely, the velocity can be expressed as the change in the angle per unit of time so $v = \dot{\theta}.$ Then the kinetic energy is $T(\theta) = \frac{1}{2}m\dot{\theta}^2.$ Going down is indeed negative velocity (the angle is getting smaller).

Derivation of the equation of motion

Now, since the total energy is constant, I could use that to try to get the equation of motion. However, using the principle of stationary action is cooler. This is defined as the integral of all possible paths of the difference between kinetic and potential energy (why this works is a matter for further study…).

\[S = \int_{t_1}^{t_2} (T(\dot{\theta}) - V(\theta)) dt = \int_{t_1}^{t_2} L(\theta, \dot{\theta}) dt\]

And we want stationary trajectories, such that the first derivative of S w.r.t the trajectory is 0. This yields¹:

\[\frac{d}{dt}\frac{\partial L}{\partial \dot{\theta}} - \frac{\partial L}{\partial \theta} = 0\]

So, let’s plug in and see what pops out.

\[\frac{\partial L}{\partial \theta} = \frac{\partial}{\partial \theta}(T(\dot{\theta}) - V(\theta)) = \frac{\partial}{\partial \theta}(- V(\theta)) = -mg\sin(\theta)\]

Because $T$ does not depend on $\theta$. Now, the other term:

\[\frac{\partial L}{\partial \dot{\theta}} = \frac{\partial}{\partial \dot{\theta}}(T(\dot{\theta}) - V(\theta)) = \frac{\partial}{\partial \dot{\theta}}(\frac{1}{2}m\dot{\theta}^2) = m\dot{\theta}\]

Since potential energy does not depend on velocity. Taking the derivative w.r.t time of the previous equation we get:

\[\frac{d}{dt}\frac{\partial L}{\partial \dot{\theta}} = m\ddot{\theta}\]

Plugging this in the Euler-Lagrange equation we get:

\[m\ddot{\theta} = - mg\sin(\theta)\]

Rewriting this equation we get:

\[\ddot{\theta} = - g\sin(\theta)\]

This cannot be solved analytically (some work can be done but is beyond me right now…), but it is possible to “solve” it numerically.

Numerical solution, simulation.

The first thing to do is to reframe this second order differential equation into a system of 2 first order differential equations. This is, $\dot{z_1} = \dot{\theta}$ and $\dot{z_2} = \ddot{\theta}.$ Now, for an initial condition of position (angle) and velocity:

\[\mathbf{f}(z_1, z_2) = (z_2, - g\sin(z_1))\]

Note that the function returns the change in position and the change in velocity. With that, a simple numerical ODE solver such as Euler’s method should do. I use the default settings in the ‘solve_ivp’ from Scipy. Then for a given initial condition (in this case $0.85 \pi$ initial position and $0$ initial velocity.) we can visualize the solution:

It looks a bit weird, and that is because it does not have any friction, it conserves the initial energy forever. We can visualize the evolution of the angle in time and we will see that it is indeed a perfect sinusoidal wave. With friction we expect a dumped oscillator.

Adding friction

Friction can be added to the equation of motion as a term that is proportional to the velocity. This is, the equation of motion becomes:

\[\ddot{\theta} = - g\sin(\theta) - c \dot{\theta}\]

This makes the animations look more realistic, here I use a damping factor of 0.1.

Solving the pendulum equations.

As I mentioned above, here I am going to take a data-driven approach. Nevertheless, using a naive neural network, even with a PINN like loss (using the second derivative as extra penalty) function does not work. It trivially interpolates the seen data, but it is not able to generalize to unseen $t$, which is also sort of expected. There is an interesting paper around this: Neural Networks Fail to Learn Periodic Functions and How to Fix It, nevertheless their proposed solution also did not work for me. As an example:

The title of each plot is the initial condition, blue is the simulated solution and orange is the neural network solution. It is clear that, beyond the training data, the neural network fails.

So, naive approaches will not work, and probably there is a lesson to be learned here about the extrapolating capabilities of neural networks. Anyway, now I dive into how to address this using a Koopman operator-like approach. Essentially this boils down to construct a basis where the system advances linearly. In this case, I use a simple autoencoder to approximate this basis. In what follows I essentially replicate the paper by Lusch et al. (reference 3).

As explained in the paper the pendulum has a continuous spectrum, meaning that has infinite possible frequencies. This implies that a simple linear model (which essentially describes a limited number of frequencies and/or decay/growth rates) will not be able to capture the full possible behaviors of the system. To address this, more flexibility is needed. What they propose is to use an auxiliary network to parameterize the eigenvalues of this linear model in a continuous fashion. In this way, we do not need infinite dimensional operator to approximate infinite frequencies.

Long story short, construct some latent space with an autoencoder such that we can find the next time step using a linear model in this latent space.

Complex numbers as rotation matrices, using Jordan canonical form to parameterize the eigenvalues.

A bit of background is needed to understand the next approach. In the paper they parameterize the eigenvalues directly, and, since we do not want to have to deal with complex types inside our neural network, a bit of math is needed. Complex eigenvalue pairs can be represented in a real block-diagonal matrix using the Jordan canonical form. This is because it is the same to multiply a two complex numbers as to multiply a vector by a 2x2 matrix:

\[(a + bi)(c + di) = (ac - bd) + (ad + bc)i\]

and

\[\begin{bmatrix} a & -b \\ b & a \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} = \begin{bmatrix} ac - bd \\ ad + bc \end{bmatrix}.\]

To specify a particular angle, we can use the polar coordinate form to see that (using angle sum and difference identities ):

\[(a + bi)(c + di) = r(\cos(\theta) + i\sin(\theta)) \cdot z(\cos(\phi) + i\sin(\phi)) = r z (\cos(\theta + \phi) + i\sin(\theta + \phi))\]

Which for a magnitude of 1, this is equivalent to a rotation matrix (determinant 1) in 2D:

\[\begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix} \cdot \begin{bmatrix} \cos(\phi) \\ \sin(\phi) \end{bmatrix} = \begin{bmatrix} \cos(\theta + \phi) \\ \sin(\theta + \phi) \end{bmatrix}.\]

This will be the kind of linear (“kind of” because $\theta$ is parameterized by a non-linear function) that will advance the system in time. Here the expressive power of a neural net is constrained to fit a very particular form. Additionally, a magnitude is needed to specify dumped oscillators, this will also be given by the auxiliary network, essentially parameterizing a complex number in polar form.

Finally, we need to add the time displacement, so the Koopman operator (for a latent space of 2 dimensions) will look like:

\[\mathbf{K}(\theta, r, \Delta t) = r\begin{bmatrix} \cos(\theta \cdot \Delta t) & -\sin(\theta \cdot \Delta t) \\ \sin(\theta \cdot \Delta t) & \cos(\theta \cdot \Delta t) \end{bmatrix},\]

Where $(\theta, r) = f_{\text{aux}}(z(t))$ and $z(t)$ is the latent space representation of the system at time $t$. Note that the latent space can be of any dimension, but care must be taken with the fact that rotation matrices work in 2D.

Network architecture, objective functions and training.

I simplify the objective function proposed by the paper by just taking an autoencoder reconstruction loss and a linear dynamics loss. The autoencoder reconstruction loss is simply the MSE of the reconstructed input without advancing it in time:

\[\mathcal{L}_{\text{AE}} = \frac{\sum_i ||x_t^i - f_{\text{dec}}(f_{\text{enc}}(x_t^i))||^2}{N}.\]

The linear dynamic loss:

\[\mathcal{L}_{\text{dyn}} = \frac{\sum_i ||x_{\Delta t}^i - f_{\text{dec}}\bigl(\mathbf{K}(f_{\text{aux}}(f_{\text{enc}}(x_0^i)), \Delta t) \cdot f_{\text{enc}}(x_0^i)\bigr)||^2}{N}.\]

Being $\mathbf{K}$ the matrix that advances $x_0$ (a vector of angle and velocity of the pendulum at time 0) to $x_t$ in time. The training data, therefore, consists of tuples $(t, \theta_0, \dot{\theta}_0, \theta_t, \dot{\theta}_t)$. The network takes initial conditions and returns the position and velocity at time $t$ and looks like:

Koopman_autoencoder(
  (encoder): encoder(
    (encoder): Sequential(
      (0): Linear(in_features=2, out_features=64, bias=True)
      (1): GELU(approximate='none')
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): GELU(approximate='none')
      (4): Linear(in_features=64, out_features=64, bias=True)
      (5): GELU(approximate='none')
      (6): Linear(in_features=64, out_features=2, bias=True)
    )
  )
  (decoder): decoder(
    (decoder): Sequential(
      (0): Linear(in_features=2, out_features=64, bias=True)
      (1): GELU(approximate='none')
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): GELU(approximate='none')
      (4): Linear(in_features=64, out_features=64, bias=True)
      (5): GELU(approximate='none')
      (6): Linear(in_features=64, out_features=2, bias=True)
    )
  )
  (K): K(
    (auxiliary_network): Sequential(
      (0): Linear(in_features=2, out_features=64, bias=True)
      (1): GELU(approximate='none')
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): GELU(approximate='none')
      (4): Linear(in_features=64, out_features=2, bias=True)
    )
  )
)

After training I get quite alright extrapolation (definitely superior to the naive approach) performance, not perfect, however:

Checking the latent space

Probably the most interesting plot here is the next one:

In phase space, the system describes an ellipse. This can be appreciated also above in the vector field plot. To be able to advance the system using a linear model, the coordinate system has to be transformed such that trajectories can be described by a circle (spiral circle in this case because of the friction). Going from an ellipse to a circle seems simple enough, particularly since the ellipse presents also this cycle behavior. It would be more interesting to check if this kind of approach could work with a system whose phase space is very far from cyclic.

In any case, it is quite satisfying to see that the autoencoder learns exactly the mapping that one would expect, very cool paper!

The code for making these figures and so on is available here.

See the first reference for the proof. ↩

Optimal Separating Hyperplanes

2024-01-16T00:00:00+00:00

References:

The following notes are mostly based on the following sources:

[1] Elements of Statistical Learning

[2] Numerical optimization, Nocedal

[3] Khan Academy (Lagrange multipliers, by 3blue1brown)

[4] Lagrangian Duality for Dummies

Introduction

Now that everyone is looking into Large Language Models (me included) I wanted to get back to the basics, to the more mathematically precise. For a while, I wanted to implement Support Vector Machines (SVMs) from scratch and this podcast motivated me to do it. Eventually, I want to get into the invariances that Vladimir Vapnik talks about, which, as I see it, quite resemble the ideas in geometric deep learning.

In any case, SVMs are not trivial, when I went through them in class (just one lecture!) I did not get anything, and Elements Of Statistical Learning is not very explicit. So, in this post I start with the optimal separating hyperplanes problem, focusing on the geometry. I let the SVMs for a second post.

So, we are looking for a hyperplane that optimally separates data into two classes. We aim to find the hyperplane that maximizes the margin between the two classes. This works only when the data are linearly separable. However, we can always expand the basis where our data lives to be able to separate it, at the cost of possibly overfitting.

Geometrical details.

First, let’s define the separating hyperplane:

\[\mathbf{\beta}^T\mathbf{x} + \beta_0 = 0\]

We are interested in the set: $H = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{\beta}^T\mathbf{x} + \beta_0 = 0\}$. Note that $\mathbf{\beta}$ is a vector of n dimensions and $\beta_0$ a scalar. These points are equidistant from the two hyperplanes $S1 = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{\beta}^T\mathbf{x} + \beta_0 = 1\}$ and $S2 = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{\beta}^T\mathbf{x} + \beta_0 = -1\}$. Those hyperplanes will pass through the support vectors (more on this later).

This is very similar to logistic regression. In logistic regression, we construct a function $\mathbb{R}^n \rightarrow [0, 1]$ and then we classify points based on whether they are above or below some probability threshold (e.g. 0.5). Indeed, by fixing a threshold we define a hyperplane. In SVMs, we construct a function $\mathbb{R}^n \rightarrow \mathbb{R}$ and then we classify points based on whether they are above 1 or below -1. However, we expect SVMs to generalize better since the margin in this classification is maximized.

To plot this say for $\mathbf{x} \in \mathbb{R}^2$, $\mathbf{\beta} = [1, 2]$ and $\beta_0 = 1$ we just need to realize that $\mathbf{\beta}^T\mathbf{x} + \beta_0 = 0$ is the same as $b_0 x_1 + b_1 x_2 + \beta_0 = 0$. So given $x_1$ we can find $x_2$ and vice versa:

\[x_2 = \frac{-\beta_0 - b_0 x_1}{b_1}\]

The same idea applies to higher dimensions.

Now, we want to find $\mathbf{\beta}$ and $\beta_0$ that maximize the margin between the two hyperplanes so first we need to define this distance. To do so we use the normal (perpendicular vector, often of unit length) to the hyperplanes.

To find the normal to a hyperplane we first define a vector parallel to the hyperplane; any vector that goes from one point to another in the hyperplane. We can accomplish this by setting the starting point of $\mathbf{x}_1$ to $\mathbf{x}_2$ such that $\mathbf{v} = \mathbf{x}_1 - \mathbf{x}_2$. What is important, however, is the direction, which is given by $\mathbf{v}$. The normal $\mathbf{w}$ to the hyperplane must be such that $\mathbf{w}(\mathbf{x}_1 - \mathbf{x}_2)=0 \quad \forall \quad \mathbf{x}_1, \mathbf{x}_2 \in H$. This $\mathbf{w}$ is given by $\frac{\mathbf{\beta}}{\lVert\beta\rVert}$ (dividing by the norm to get unit length) by the definition of the hyperplane $H$.

To find the distance between two hyperplanes, S1 and S2, we focus on a given point $\mathbf{x}_1 \in S1$. Let’s find the point in S2 in the direction of the perpendicular from $\mathbf{x}_1$. The perpendicular line that passes through $\mathbf{x}_1$ is given by $\mathbf{x}_1 + m\frac{\mathbf{\beta}}{\lVert\beta\rVert}$. For $m$ being… the margin!

Again, divide $\mathbf{\beta}$ by its norm to get a unit vector there. Clearly, for m = 0, it intersects $\mathbf{x}_1$. Which is the corresponding x in S2?

We know that $\mathbf{\beta}^T(x_1 + m\frac{\mathbf{\beta}}{\lVert\beta\rVert}) + \beta_0 = -1$ for some $m$, which is precisely the number we want to find. So with some algebra;

\[\mathbf{\beta}^T x_1 + m\frac{\lVert\beta\rVert^2}{\lVert\beta\rVert} + \beta_0 = -1\] \[\mathbf{\beta}^T x_1 + m\lVert\beta\rVert + \beta_0 = -1\] \[m = \frac{-\mathbf{\beta}^T x_1 - \beta_0 - 1}{\lVert\beta\rVert}\]

And from the definition of S1, we know $\beta^T x_1 = 1 - \beta_0$,

\[m = \frac{-1 + \beta_0 - \beta_0 - 1}{\lVert\beta\rVert} = \frac{-2}{\lVert\beta\rVert} \propto \frac{1}{2}\frac{1}{\lVert\beta\rVert}\]

And that is the definition of the margin, it is inversely proportional to the norm of $\beta$. Importantly, we only care about its absolute value. It is relevant to notice that this function will be maximized at the same point as $\frac{1}{\lVert\beta\rVert^2}$, this is important because it will make the optimization problem easier.

To me, that result was not intuitive, let’s see this in action:

Yep, it seems to work… Intuitively, given a direction, we can think of the increase/decrease in $\mathbf{\lVert x \rVert}$ that is required to move from one hyperplane to the other (e.g. from S1 to S2). The bigger the sensitivity to $\mathbf{x}$ the less we need to move to get a change of hyperplane. This is why minimizing the norm of $\mathbf{\beta}$ is equivalent to maximizing the margin.

Optimal Separating Hyperplane problem formulation.

Now we have the geometrical background to frame the problem:

We can do this by minimizing $\frac{1}{2}\lVert\beta\rVert^2$. However, we need to make sure that this hyperplane separates the data correctly. We can do this by subjecting this minimization process to the constraint of $y_i(\mathbf{\beta}^T\mathbf{x}_i + \beta_0) \geq 1 \quad \forall \quad i = 1, \dots, n$. Being $y_i$ the class of $\mathbf{x}_i$ which is eihter 1 or -1. This constraint is equivalent to saying that the point is on the right side of the hyperplane and with enough margin. Negative values imply wrong classification, values between 0 and 1 imply that the point is between the two hyperplanes, which is not what we want. Finally, the problem is:

\[\min_{\mathbf{\beta}, \beta_0} \frac{1}{2}\lVert\beta\rVert^2\] \[\text{s.t.} \quad y_i(\mathbf{\beta}^T\mathbf{x}_i + \beta_0) \geq 1 \quad \forall \quad i = 1, \dots, n\]

This is the same formulation given in Elements of Statistical Learning (they arrive here in a very confusing way imho). The solution to this problem is far from trivial and requires going rapidly over some optimization theory.

Optimization process.

Constrained optimization (Lagrangian).

The idea of the Lagrangian is based on the fact that the gradient of the function we are optimizing and the gradient of the constraint are proportional at the optimum. The proportionality constant is called the Lagrange multiplier. The Lagrangian is just the way of packing up that information in a way that, when optimizing the Lagrangian w.r.t the original variables and the Lagrange multiplier, we are just finding the proportionality constant and satisfying the constraint. Nevertheless, for inequality constraints, as is the case here, the Lagrangian is not (directly) enough. We need to introduce the idea of the KKT conditions.

KKT conditions.

This is a generalization of the method of Lagrange multipliers (Lagrangian). The Karush-Kuhn-Tucker (KKT) conditions are just first-order necessary conditions for a constrained problem, they follow relatively intuitively. They tell you that the Lagrangian must be at a stationary point and the way this has to happen. Equality constraints must be satisfied, inequality constraints that are not active must have a 0 Lagrange multiplier and therefore the “second” part of the Lagrangian is going to add to 0 (either a restriction is active or the Lagrange multiplier is 0). The KKT conditions come as follows:

\[\nabla f(x^*) + \sum_{i \in A} \lambda_i^* \nabla c_i(x^*) = 0\]

This just tells us that the Lagrangian must be equal to 0, this means that the gradient of the objective function and constraint function must be proportional (kind of as before). And now, how this must happen:

\[c_i(x^*) = 0, i \in E\]

For E the set of equality constraints.

\[c_i(x^*) \leq 0, i \in I\]

For I the set of inequality constraints.

\[\lambda_i^* \geq 0, i \in I\] \[\lambda_i^* c_i(x^*) = 0, i \in I \cup E\]

Satisfying this is enough to have a first-order optimality condition. Which should be enough for convex problems. Importantly, the satisfied constraints will have a non-cero Lagrange multiplier. In the case of optimal separating hyperplanes, those will be the points corresponding to the support vectors. As we will see later, $\beta$ is entirely defined as a weighted combination of the support vectors.

Now the issue is, how do we solve this? We have defined some optimality conditions but this is harder than unconstrained problems, what can we do about this?

In Elements of Statistical Learning, they propose to solve this by first simplifying the problem through the dual form and then using a “standard” constrained optimization algorithm. Let’s see how this dual form helps.

Dual-Primal forms.

The idea here is to approximate an infinite penalty for breaking a constraint by a finite, linear penalty. This is done by introducing a new variable $\mathbf{\alpha}$, which is the Lagrange multiplier for the inequality constraints. In essence, we would have the original problem if $\alpha = \infty$, and we have a lower bound when $\alpha \leq \infty$. We can then write the Lagrangian as:

\[\min_{\mathbf{\beta}, \beta_0} \max_{\mathbf{\alpha}} \frac{1}{2}\lVert\beta\rVert^2 - \sum_{i=1}^n \alpha_i (y_i[\mathbf{\beta}^T\mathbf{x}_i + \beta_0] - 1)\]

The sign before the Lagrange multipliers here comes from the fact that if we allow for values equal or bigger than 1, then $\alpha = 0$, but we have to infinitely penalize values smaller than 1, which are elements either wrong classified or between the two hyperplanes. For some intuition, note that in the latter case, $y_i[\mathbf{\beta}^T\mathbf{x}_i + \beta_0] - 1$ becomes negative, so big positive values of $\alpha$ will make the Lagrangian big.

That is hard. But if we reverse the order to:

\[\max_{\mathbf{\alpha}} \min_{\mathbf{\beta}, \beta_0} \frac{1}{2}\lVert\beta\rVert^2 - \sum_{i=1}^n \alpha_i (y_i[\mathbf{\beta}^T\mathbf{x}_i + \beta_0] - 1)\]

This is what is known as the dual form, which is much more tractable. Now, this will only be the same in some cases (strong duality). Luckily, this is the case for the optimal separating hyperplane problem (and for SVMs).

We can solve the minimization bit of the problem by taking the gradient w.r.t $\mathbf{\beta}$ and $\beta_0$ and setting it to 0 (first-order optimality condition). This will (easily) give us the following:

\[\mathbf{\beta} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i\] \[\sum_{i=1}^n \alpha_i y_i = 0\]

And then plugging this in:

\[\max_{\mathbf{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \alpha_i \alpha_j y_i y_k \mathbf{x}_i^T \mathbf{x}_k\] \[\text{s.t.} \quad \sum_{i=1}^n \alpha_i y_i = 0 \text{ and } \alpha_i \geq 0 \quad \forall \quad i = 1, \dots, n\]

(Constraints to satisfy KKT conditions)

Which is easy to solve.

The derivatives of that function, let’s call it $Ld$ are easy to compute if we reframe the above problem in terms of outer products. For our current notation, this boils down to:

\[\frac{dL_D}{d\alpha_j} = 1-\sum_{i=1}^N\alpha_iy_jy_ix^T_jx_i\]

In code this would be:

def Ld(alpha, X, y):
        """
        Lagrangian dual function
        """
        alpha_outer = np.outer(alpha, alpha)
        y_outer = np.outer(y, y)
        X_outer = np.dot(X, X.T) # Dual, kernel!

        my_sol = np.sum(alpha) - 0.5 * np.sum(alpha_outer * y_outer * X_outer)
        

      
        return -1 * my_sol   # -1 Because the opt. program works with minimization.

and the gradient:

def dLd(alpha, X, y):
        """
        Derivative of the Lagrangian dual function
        """  
        y_outer = np.outer(y, y)
        X_outer = np.dot(X, X.T) 
        my_grad =  np.ones(alpha.shape) - np.sum(alpha[np.newaxis] * y_outer * X_outer, axis=1)
        
        return -1 * my_grad # Again the minimization issue...

Which is simpler! Now we can solve this by using an optimizer that can handle inequality constraints. (How these work is actually quite interesting and complicated but it gets quite out of scope right now…)

Solving the problem.

  # Define the constraints
  # 1. Alphas bigger or equal than 0 (bounds)
  my_bounds = [(0, np.inf)] * len(y)
  # 2. Sum of alphas times labels equal to 0 (linear constraint)
  # See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.LinearConstraint.html
  my_constraint = LinearConstraint(y, 0, 0)
  alpha0 = np.zeros(len(y))
    # Optimize
  res = minimize(Ld, alpha0, args=(X, y), jac=dLd,
                  constraints=my_constraint, 
                  bounds=my_bounds,
                  options={'disp': True}, method = 'SLSQP')
  # Get the support vectors
  idx_support_vectors = np.where(res.x > 1e-5)[0]
  alphas = res.x
  w = np.sum((alphas * y).reshape(-1, 1) * X, axis=0)
  # The bias can be extractr from any support vector
  b = y[idx_support_vectors[0]] - np.dot(w, X[idx_support_vectors[0]])

The definition of $\beta$ (in the code w), follows from the solution to the Lagrangian, same with the bias (b). We can see how $\beta$ is just a weighted combination of the support vectors. The bias is just the value of the hyperplane at one of the support vectors. People recommend taking the average of the bias over the support vectors, for numerical stability reasons.

Let’s try some toy data:

    np.random.seed(42)
    N = 50
    # Generate random points for two classes
    class_1_points = np.random.randn(N, 2) + np.array([2, 2])
    class_2_points = np.random.randn(N, 2) + np.array([-2, -2])

    # Combine points and assign labels
    X = np.vstack((class_1_points, class_2_points))
    y = np.hstack((np.ones(N), -np.ones(N)))

    # Plot the points with different markers for each class
    plt.scatter(X[y == 1, 0], X[y == 1, 1], label='Class 1', marker='o')
    plt.scatter(X[y == -1, 0], X[y == -1, 1], label='Class -1', marker='x')

    plt.xlabel('X-axis')
    plt.ylabel('Y-axis')
    plt.axhline(0, color='black', linewidth=0.5)
    plt.axvline(0, color='black', linewidth=0.5)
    plt.legend()
    plt.grid(color='gray', linestyle='--', linewidth=0.5)
    # Plot
    idx_support_vectors, w, b = opt_sep_hyperplane(X, y)
    plt.scatter(X[idx_support_vectors, 0], X[idx_support_vectors, 1], c='r', marker='.', s = 25)
    # print the hyperplane
    
    # Hyperplane is w[0] * x + w[1] * y + b = 0
    # Solve for y:
    # Get the limits of the plot   
    x_min, x_max = plt.xlim()
    x = np.linspace(x_min, x_max, 100)
    y = (-w[0] * x - b) / w[1]
    plt.plot(x, y)
    y_1 = (-w[0] * x - b + 1) / w[1]
    plt.plot(x, y_1, '--', c = 'g')
    y_m_1 = (-w[0] * x - b - 1) / w[1]
    plt.plot(x, y_m_1, '--', c = 'g')

A key point is that $\mathbf{\alpha}$ will be a sparse vector since there are only two points that are taken into consideration to construct the hyperplane.

Kernels.

def _RBF_kernel(X1, X2, gamma = 1):
        m, d1 = X1.shape
        n, d2 = X2.shape

        # Compute pairwise squared Euclidean distances
        # This comes from the fact that ||x - y||^2 = ||x||^2 + ||y||^2 - 2
        # Which is the analogous to the elemental identity (a - b)^2 = a^2 + b^2 - 2ab
        dist_sq = np.sum(X1**2, axis=1).reshape((m, 1)) + np.sum(X2**2, axis=1) - 2 * np.dot(X1, X2.T)

        # Compute RBF kernel matrix
        K = np.exp(-gamma * dist_sq)

        return K

    def kernel(X1, X2):
        """
        RBF kernel
        """
        return _RBF_kernel(X1, X2)

We can go one step further and enrich the basis of our data with an infinite dimensional basis, using, for example, the Radial Basis Function (RBF) (check this out). Since we construct a “similarity” matrix with the inner products of the data points, we could have any number of dimensions in those data points if we can find a way of computing the inner product (which is the definition of a Hilbert space). In the case of the RBF this is given by:

\[K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)\]

Where $\gamma$ is a hyperparameter. Which results in:

Some comments on how this is made are relevant since it is not immediate how to make predictions when using a kernel. It implies that we can only work with $\mathbf{X}\mathbf{X}^T$ forms. We cannot explicitly compute the weights of the hyperplane (we can, and need, compute the bias, however). Luckily this is not a problem since, from the definition of the weights:

\[\mathbf{\beta} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i\]

we can see that this is just a sum over the rows of the dataset weighted by the Lagrange multipliers. This means that $\mathbf{\beta}$ is just a weighted sum of the support vectors ($\alpha_i \neq 0$). Since then we can compute the predictions as:

\[\hat{y} = \text{sign}(\beta \mathbf{x}_j^T + \beta_0)\]

Plugging in the definition of $\beta$ we get:

\[\hat{y} = \text{sign}(\sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \mathbf{x}_j^T + \beta_0)\]

which we can turn into:

\[\hat{y} = \text{sign}(\sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_j) + \beta_0)\]

for $K$ being the kernel function. So we can calculate the bias term as:

\[\beta_0 = \frac{1}{N_S} \sum_{i \in S} (y_i - \sum_{j \in S} \alpha_j y_j K(\mathbf{x}_i, \mathbf{x}_j))\]

using the mean over the support vectors (or just using one support vector, as in the code). The class predictions then;

\[\hat{y} = \text{sign}(\sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_j) + \beta_0)\]

    
    b_kernel = y[idx_support_vectors[0]] - np.sum((alphas * y).reshape(-1, 1) * kernel(X, X), axis=0)[idx_support_vectors[0]]
    prediction_fun = lambda X_test: np.sign(np.sum((alphas * y).reshape(-1, 1) * kernel(X, X_test), axis=0) + b_kernel)

Final comments.

The reflection about “modern” ML and traditional ML, representation learning vs kernels will have to wait until the end of the SVMs blog.

Here we learned about the optimal separating hyperplane problem, which is the basis of SVMs. We saw how to formulate the problem and how to solve it. We also saw how to use kernels to enrich the basis of our data and how to make predictions with them. That’s it for now!

Diffusion Models I. Approximating the gradient of the data distribution.

2023-02-16T00:00:00+00:00

Diffusion models are another approach to generative modelling. The algorithm became popular with the release of Dalle-2 and Stable Diffusion. However, the underlying idea has been around for some time already.

In these notes, I will not focus on the details of the current SOTA algorithms but on the mathematical foundations of the idea. I will not describe conditional (image) generation. I will focus on sampling from the unknown probability distribution of a given dataset, first 2D and finally MNIST. My idea here is to provide an intuition of how this stuff works. For a more rigorous treatment check the references!

As an exercise I translated the pytorch code I found to tensorflow so all code here is in the latter. (I am using python 3.9.13)

These notes are (mostly) based on:

[1] This excellent repository: https://github.com/acids-ircam/diffusion_models

[2] This awesome article: https://arxiv.org/abs/2208.11970

[3] The work from Yang Song, who accompanies his research with super helpful blog posts: https://yang-song.net/blog/2019/ssm/, https://yang-song.net/blog/2021/score/ with the paper https://arxiv.org/abs/1907.05600.

[4] Some classic papers on the topic: https://arxiv.org/pdf/2006.11239.pdf, https://www.jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf, https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf, https://arxiv.org/abs/1505.04597, chapter 5 Deep Learning book, RefineNet

[5] Some other blogposts like: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Introduction

The name diffusion already gives a clue about the underlying idea: Reversing a diffusion process. To do so, we construct a function that can go from pure noise (the endpoint of the diffusion process) to the original coherently structured substance (the original point). In this sense, going from noise to coherent data, diffusion models are similar to GAN, but the similarities end there.

Another way of looking at this, motivated by the score based modelling point of view, is the idea of “navigating” a high dimensional space towards the areas where the coherence (w.r.t our data) within the dimensions is maximized. Or, what is the same, climbing a high dimensional probability distribution towards the peak areas. To clarify, high probability regions (the peaks) in this space are where the combination of the dimensions is more likely to render an observation belonging to our dataset. This is, gradient ascent w.r.t. the data distribution, starting from noise (random initialization) and generating an image. This is illustrated in the next GIF (which will be generated from scratch in these notes):

Surprisingly enough, it is possible to estimate gradients of a dataset even when we do not have an explicit probability distribution (if we had it there would be no point in doing this anyway).

As a metaphor, what we are going to train here is the compass of the “navigators”, a model that gives us the gradient w.r.t the data distribution at any given point such that we can find our way towards high probability regions.

How can we estimate the gradient of a dataset? Let’s get to it!

Estimating the gradient of a dataset:

Imports and the data:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import datasets
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tqdm import tqdm
import tensorflow_addons as tfa
plt.style.use('Solarize_Light2')

def get_batch(size, noise = 0.05, type = 'moons'):
    if type == 'moons':
        sample = datasets.make_moons(n_samples=size, noise=noise)[0]
    else:
        sample = datasets.make_circles(n_samples=size, factor=0.5, noise= noise)[0]
    return sample

data = get_batch(10**4, type = '')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25, 15))
ax1.scatter(*data.T, alpha = 0.5, color = 'green', edgecolor = 'white', s = 40)
ax2.hist2d(*data.T, bins = 50);

For an unknown probability distribution $p(x)$ we want to estimate $\nabla \log p(x)$. So we can frame this as a regression problem with something like:

$$\frac{1}{2}\mathop{\mathbb{E}}_{x \sim p(x)}[|| \mathcal{F}_{\theta} - \nabla \log p(x)||²]$$

Being $\mathcal{F}_{\theta}$ a very flexible function, like a neural network with parameters $\theta$.

Yeah, but we don’t know $p(x)$! It turns out (using integration by parts and some reasonable assumptions) that the above equation can be reformulated as:

$$\mathop{\mathbb{E}}_{x \sim p(x)} \bigg[\text{tr}(\nabla_x \mathcal{F}_{\theta}(x)) + \frac{1}{2}||\mathcal{F}_{\theta}(x)||^2 \bigg]$$

(The trace arises in the multidimensional version.)

Which does not have any $p(x)$ inside. Now, this, called (vanilla) score based generative modelling can already be used.

Let’s check if we can approximate the gradient of the above 2D circle’s dataset.

First, let’s define our $\mathcal{F}_{\theta}(x)$ using a fully conected network:

# The gradient
# Now, this takes the point (x_1, x_2) and returns the gradient w.r.t. (x_1, x_2) at that point.
F_model = tf.keras.Sequential([
    layers.Dense(128, input_shape = (2, ), activation= 'linear'),
    layers.Dense(128, activation= 'gelu'),
    layers.Dense(64, activation= 'gelu'),
    layers.Dense(32, activation= 'gelu'),
    layers.Dense(2, activation= 'linear') # TWO DIMENSIONS!
])

And $\nabla_x \mathcal{F}_{\theta}(x)$:

# Generate the Hessian
@tf.function 
def Hessian(F, x):
    '''
    Computes jacobian of the gradient (F_model) w.r.t x.
    :param F: function R^N -> R^N
    :param x: tensor of shape [B, N]
    :return: Jacobian matrix of shape [B, N, N]

    '''
    with tf.GradientTape() as tape:
        tape.watch(x)
        my_gradient = F(x)
        hessian = tape.batch_jacobian(my_gradient, x)
    
    return hessian

Notice that the derivatives are w.r.t the data x (not the parameters of F_model network):

Nice! then we only need to compute the loss, as we specified above:

# Now, I have the Jacobian (the gradient) (B, 2) and the Hessian (B, 2, 2)
def score_matching(F, x):
    gradient = F(x)

    # Jacobian part
    norm_gradient = (tf.norm(gradient, axis = 1) ** 2) /2
    # Hessian part
    hessian = Hessian(F, x)

    tr_hessian = tf.cast(tf.linalg.trace(hessian), dtype = tf.float32)

    return tf.math.reduce_mean(tr_hessian + norm_gradient, axis = -1)

And training loop:

# Now training loop
optimizer = keras.optimizers.Adam(learning_rate= 1e-4)

loss_l = []

for t in range(2500): #Epochs
    with tf.GradientTape() as tape:
        loss = score_matching(F_model, data)
        
        if len(loss_l) > 1:
            if loss_l[-2] < loss_l[-1]: # trivial early stopping
                break

        model_grads = tape.gradient(loss, F_model.trainable_weights)

        optimizer.apply_gradients(zip(model_grads, F_model.trainable_weights))
        

        loss_l.append(loss)
        if ((t % 100) == 0):
            print(loss_l[-1])

F_model.save_weights('./Vanilla_score_weights.h5')

We now can plot the gradients that out F_model estimates:

def plot_gradients(model, data, plot_scatter = True):
    xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis = -1).reshape(-1, 2)
    scores = model(xx).numpy() # the gradients w.r.t each data point! 
    # This is, how much the DENSITY OF THE DATA increases at a given point of the xx (meshgrid)

    # Now that stuff is not good to visualize. Some scaling to make a nice plot:
    scores_nrom = np.linalg.norm(scores, axis = -1, ord = 2, keepdims = True)
    scores_log1p = scores / (scores_nrom + 1e-9) * np.log1p(scores_nrom)
    plt.figure(figsize=(16,12))
    if (plot_scatter):
        plt.scatter(*data.T, alpha=0.3, color='red', edgecolor='white', s=40)
    plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='black')
    plt.xlim(-1.5, 2.0)
    plt.ylim(-1.5, 2.0)

plot_gradients(F_model, data)

As we can see, this surprisingly works! That’s far from evident just looking at the loss function! To sample we can use a combination of the gradient and some noise: langevin dynamics.

def langevin_dynamics(F, x, n_steps, eps = 0.7e-2, decay = .9, temperature = 1):
    # Just a naive langevin dynamics "sampler"
    x_sequence = [x]
    for s in range(n_steps):
        z_t = np.random.normal(size = x.shape)
        gradient = F(x).numpy()

        x = x + eps * (gradient +  (temperature * z_t ))
        x_sequence.append(x)
        eps *= decay
        
    return np.array(x_sequence).reshape(n_steps + 1, 2)

Now, there is a very important detail. This method does not work well in high-dimensional spaces, where most of the space is empty. This is especially relevant for images, which live in a low-dimensional manifold. In that case $\log p(x)$ may become $-\infty$. A way to solve this is to add noise to the data to fill the empty space and have reliable gradients everywhere. Solving the data sparsity problem in a high dimensional space through noise is one of the key ideas in Diffusion Models.

Now, while this is cool and I think helps in understanding the connection with the unknown $p(x)$, there is even a more surprising, simpler and efficient approach. We can approximate the log “gradient” of the data distribution as the gradient of a Gaussian density

\[q_{\sigma}(\tilde{x}|x)\]

with mean at $x$ and standard deviation $\sigma$ w.r.t. a noisy data point $\tilde{x}$.

This makes intuitive sense, the “gradient” should point us towards a combination of parameters which renders a less noisy data point. This is called denoising score matching. Using Gaussian noise and a Gaussian density as the kernel, the loss function looks like:

$$Loss(\theta| \sigma) = \mathop{\mathbb{E}}_{x \sim p(x)} \bigg[\frac{1}{2}\bigg|\bigg|\mathcal{F}_{\theta}(\tilde{x}, \sigma) - \frac{x - \tilde{x}}{\sigma^2} \bigg|\bigg|^2_2 \bigg]$$

Because:

$$\frac{\mathbb{d}\log q_{\sigma}(\tilde{x}|x)}{\mathbb{d}\tilde{x}} = \frac{x- \tilde{x}}{\sigma^2}$$.

Same as before, however, we want a reliable “gradient” everywhere. The solution here is to add different levels of noise $\sigma$:

# Different sigma values
sigma_begin = 1
sigma_end = 0.01
num_noises = 10
sigmas = np.exp(np.linspace(np.log(sigma_begin), np.log(sigma_end), num_noises))
plt.plot(sigmas)

And this looks like:

Here we have the noise added to the initial data and plotted on top in different color/shape.

This works best if we let our model know (our approximation to the gradient) at which noise level $\sigma$ we are. So this time our model (our fully connected NN) will also take an embedding of the $\sigma$ level. And we will insist on it! Feeding a representation of the noise level index at every layer.

Input_data = keras.Input(shape=(2,))
labels_input = keras.Input(shape=(1,))

d = layers.Dense(128, activation= 'linear')(Input_data)
l = tf.keras.layers.Embedding(num_noises, 128)(labels_input)

d = keras.layers.Multiply()([d, l])
d = tf.keras.activations.gelu(d)

for i in (128, 128):
    d = layers.Dense(i, activation= 'linear')(d)
    l =  tf.keras.layers.Embedding(num_noises, i)(labels_input)
    
    d = keras.layers.Multiply()([d, l])
    d = tf.keras.activations.gelu(d)

output = keras.layers.Dense(2)(d)
output = keras.layers.Flatten()(output)

F_model = keras.Model([Input_data, labels_input], output)

Which is already looking relatively fancy! And this is just 2D!

And we generate some $\sigma$ levels (indexes) to use during training:

labels = np.random.randint(0, num_noises, data.shape[0])

Now we are ready to write down our Noise Conditional Loss function:

def conditional_noise_loss(F, x, labels = labels, sigmas = sigmas):
    
    used_sigmas = sigmas[labels][..., np.newaxis]

    # Generate noise for a given level (label)
    noise = np.random.normal(size = x.shape) * used_sigmas
    
    perturbed_x = x + noise

    # \frac{x - \tilde{x}}{\sigma^2}
    target =  tf.constant((data - perturbed_x) / (used_sigmas ** 2), dtype =tf.float32)

    # Our approximation to the gradient now takes 2 inputs:
        # The noisy x.
        # The noise (index) level.
    gradient = F([perturbed_x, labels]) # takes the label as embedding!

    loss = 1/2 * (tf.norm(gradient - target, axis = 1)) * used_sigmas ** 2
    
    return tf.math.reduce_mean(loss)

And we can train this:

# Now training loop
optimizer = keras.optimizers.Adam(learning_rate= 1e-3)

loss_l = []
epochs = tqdm(range(5000))
for t in epochs: 
    with tf.GradientTape() as tape:
        loss = conditional_noise_loss(F_model, data)

        model_grads = tape.gradient(loss, F_model.trainable_weights)

        optimizer.apply_gradients(zip(model_grads, F_model.trainable_weights))
        

        loss_l.append(loss)
        epochs.set_description("Loss: %s" % loss_l[-1].numpy())
        
F_model.save_weights('./Denoising_conditional_weights.h5')

And once it is done we can again visualize the gradients:

And to sample, we can use a fancier version of the ‘langevin_dynamics’ function we used before. It does the same but looping over different noise levels.

def ald_sampling(F, sigmas, num_noises, iter, step_size):
    '''
    Sampling and visualization.

    '''
    plot_gradients(F_model, data) # Plot distribution landscape

    x_t = np.random.normal(size = (1, 2)) # Initial sample

    samples = [] # Placeholder

    # Loop over noise levels:
    for noise_level in range(num_noises):
        alpha = step_size * (sigmas[noise_level]**2 / sigmas[-1]**2)
        # noise level inner sampling:
        for t in range(iter):
            z = np.random.normal(size = (1, 2))
            gradient = F([x_t, np.array([[noise_level]])]).numpy()
            
            x_t = x_t + (alpha/2) * gradient + np.sqrt(alpha) * z
            samples.append(np.ravel(x_t))

    # Plot (given noise level) samples
    color = np.array([[i] * iter for i in sigmas]).ravel()

    plt.scatter(*np.array(samples).T, s=250, c = color)

    samples = np.array(samples)
    # Draw arrrows
    deltas = (samples[1:] - samples[:-1]) # Difference
    
    for i, arrow in enumerate(deltas):
        plt.arrow(samples[i,0], samples[i,1], arrow[0], arrow[1],
                    width=1e-4, head_width=2e-2, color="green", linewidth=0.2)
    plt.colorbar(fraction=0.01, pad=0.01)
    plt.show()
    return samples

samples = ald_sampling(F_model, sigmas, num_noises, 20, 0.0001)

Now we can see how when the level of noise decreases the samples converge to the actual data distribution. The colour indicates the amount of noise, from maximum (yellow) to minimum (purple).

Estimating the gradient for MNIST:

MNIST is way more challenging, but the underlying principles are the same!

Data loading and scaling:

def scale_image(image):
    return (image - (255/2)) / (255/2) # -1 to 1 to make it easier

data_mnist = keras.datasets.mnist.load_data(path="mnist.npz")[0][0]
data_mnist = scale_image(data_mnist[..., np.newaxis])

And now… we really need to go fancy with the model. An arbitrary network does not work, we need something with a proper inductive bias. Since we are concerned with the gradient at the pixel level (each of our dimensions) but still need to take information over the whole picture, a network designed for image segmentation is ideal. An option is RefineNet (the images are from the paper).

The idea is to first downsample the data using a ResNet to 1/4, 1/8, 1/16 and 1/32 (in our case we begin in 1/1). The stride is typically set to 2, thus reducing the feature map resolution to one-half when passing from one block to the next.

After, it applies a multi path refinement (as shown in the image). The key point here is that the downsampling allows us to get general information about the picture while at the same time, eventually, focusing at the pixel level. Each RefineNet block takes a representation of the downsampled version and a higher resolution version until, in our case, reaching the pixel level

Nevertheless, there is still the issue of how to encode the $\sigma$ level information. One option is to use “conditional instance normalization”. Instance normalization consists basically on normalizing the feature maps per image.

Now, what this does is: Let $\mu_k$ and $s_k$ denote mean and std of the k-th feature map of x (an image).

\[z_k = \gamma[i, k]\frac{x_k - \mu_k}{s_k} + \beta[i, k]\]

Where $\gamma$ and $\beta$ are learnable parameters. These parameters are embeddings of the noise level. The dimensionality of the embedding is such that there is a scalar for each channel. Hence, given k, $\gamma$ and $\beta$ are scalars. Basically, we are doing sort of the same as before, scaling the output of the convolutional layers based on the embedding of the noise level.

Let’s code it. First instance normalization:

class CIN(tf.keras.layers.Layer):
    def __init__(self, num_noises, num_features):
        super().__init__()
        self.num_features = num_features
        self.num_noises = num_noises
        self.instance_norm = tfa.layers.InstanceNormalization()
        
        self.gamma = tf.keras.layers.Embedding(input_dim = self.num_noises,
                                            output_dim = self.num_features,
                                            embeddings_initializer = tf.keras.initializers.RandomNormal(1., 0.02))

        self.beta = tf.keras.layers.Embedding(input_dim = self.num_noises,
                                            output_dim = self.num_features,
                                            embeddings_initializer = 'Zeros')
        
    def call(self, image, noise_level):
        
        image_norm = self.instance_norm(image) # (B, height, width, num_features)

        # Scalars
        my_gamma = tf.expand_dims(self.gamma(noise_level), axis = 1) # (B, 1, 1  num_features)
        
        my_beta = tf.expand_dims(self.beta(noise_level), axis = 1)# (B, 1, 1, num_features)

        z = my_gamma * image_norm + my_beta # (B, height, width, num_features)

        return z

Let’s construct a ResNet:

# norm -> non-linear -> conv -> norm -> non-linear -> conv -> Downsample by 2

class ResNetBlock(tf.keras.layers.Layer):
    def __init__(self, output_features, num_noises, downsampling = True):
        super().__init__()

        self.downsampling = downsampling
        
        self.act = tf.keras.layers.ELU()

        self.embd1 = CIN(num_noises, output_features)
        self.embd2 = CIN(num_noises, output_features)

        self.conv1 = tf.keras.layers.Conv2D(output_features,
                                                        kernel_size = 3,
                                                        padding = 'SAME')

        self.conv2 = tf.keras.layers.Conv2D(output_features,
                                                        kernel_size = 3,
                                                        padding = 'SAME')

        self.down =  tf.keras.layers.Conv2D(output_features,
                                                        kernel_size = 3,
                                                        strides = 2,
                                                        padding = 'SAME')
       
    def call(self, image, noise_level):
        
        h = self.embd1(image, noise_level)
        h = self.act(h)
        h = self.conv1(h)

        h = self.embd2(h, noise_level)
        h = self.act(h)
        h = self.conv2(h)

        if self.downsampling:
            return self.down(image + h)
        else:
            return image + h

Ok, now let’s deal with RefineNet:

It consists of a residual convolutional unit, multi-resolution fusion and chained residual pooling.

# Residual convolutional Unit
class RCU(tf.keras.layers.Layer):
    def __init__(self, input_features, num_noises):
        super().__init__()
        
        self.Embedding1 = CIN(num_noises, input_features)
        self.Convolution1 = tf.keras.layers.Conv2D(input_features,
                                                    kernel_size = 3,
                                                    activation = 'ELU',
                                                    padding = 'SAME')

        self.Embedding2 = CIN(num_noises, input_features)
        self.Convolution2 = tf.keras.layers.Conv2D(input_features, 
                                                    kernel_size = 3,
                                                    padding = 'SAME')

        self.first_act = tf.keras.layers.ELU()

    def call(self, image, noise_level):
        
        res = image

        x = self.first_act(image)
        x = self.Embedding1(x, noise_level)
        x = self.Convolution1(x)
        x = self.Embedding2(x, noise_level)
        x = self.Convolution2(x)
        
       
        return res + x 

# Now the multi resolution thing:
class MRF(tf.keras.layers.Layer):
    def __init__(self, im_in, input_features, num_noises, shape_target):
        super().__init__()
        
        self.shape_target = shape_target
        self.im_in = im_in
        self.embeddings = []
        self.Conv = []


        for i in range(im_in):
            self.embeddings.append(CIN(num_noises, input_features))
            self.Conv.append(tf.keras.layers.Conv2D(input_features,
                                                        kernel_size = 3,
                                                        padding = 'SAME'))
            
    
    def call(self, images, noise_level):

        
        if  self.im_in == 1:
            h = self.embeddings[0](images[0], noise_level)
            h = self.Conv[0](h)
            
            h = tf.image.resize(h, self.shape_target[:2]) 
            
            return h

        else:
            
            h1 = self.embeddings[0](images[0], noise_level)
            h1 = self.Conv[0](h1)
            
            h1 = tf.image.resize(h1, self.shape_target[:2]) # Resizes, if needed, to target
            #Upsmaples using bilinear interpolation
            h2 = self.embeddings[1](images[1], noise_level)
            h2 = self.Conv[1](h2)
            
            h2 = tf.image.resize(h2, self.shape_target[:2]) 
            sums = h1 + h2

            return sums

# Chained residual pooling
class CRP(tf.keras.layers.Layer):
    def __init__(self, input_features, num_noises, n_blocks = 2):
        super().__init__()
       
        self.embeddings = []
        self.conv = []
        self.avg_pool = []

        self.n_blocks = n_blocks
        for i in range(n_blocks):
            self.embeddings.append(CIN(num_noises, input_features))
            self.avg_pool.append(tf.keras.layers.AveragePooling2D(pool_size = 5,
                                                                  padding = 'SAME',
                                                                  strides = 1))
            self.conv.append(tf.keras.layers.Conv2D(input_features,
                                                        kernel_size = 3,
                                                        padding = 'SAME'))

            self.first_act = tf.keras.layers.ELU()
    
    def call(self, image, noise_level):

        x = self.first_act(image)
        
        sum = x

        for i in range(self.n_blocks):
            x = self.embeddings[i](x, noise_level)
            x = self.avg_pool[i](x)
            x = self.conv[i](x)

            sum = x + sum

        return sum

So a block of RefineNet:

class RefineNetBlock(tf.keras.layers.Layer):
    def __init__(self, im_in, input_features, num_noises, shape_target):
        super().__init__()

        self.RCUBig1 = RCU(input_features, num_noises)
        self.RCUBig2 = RCU(input_features, num_noises)

        if im_in == 2:
            self.RCUSmall1 = RCU(input_features, num_noises)
            self.RCUSmall2 = RCU(input_features, num_noises)

        self.MRF = MRF(im_in, input_features, num_noises, shape_target)
        self.CRP = CRP(input_features, num_noises)

        self.final_conv = RCU(input_features, num_noises)

    def call(self, image_big, image_small, noise_level):

        image_big_processed = self.RCUBig1(image_big, noise_level)
        image_big_processed = self.RCUBig2(image_big_processed, noise_level)
        
        if image_small is not None:
            image_small_processed = self.RCUSmall1(image_small, noise_level)
            image_small_processed = self.RCUSmall2(image_small_processed, noise_level)

            x = self.MRF([image_big_processed, image_small_processed], noise_level)
        else:
            x = self.MRF([image_big_processed], noise_level)
        
        x = self.CRP(x, noise_level)
        x = self.final_conv(x, noise_level)

        return x

Indeed, the network is quite complex (and this is nothing!). Let’s put everything together.

Naturally, we do not want to work with the image in 1/4 but in 1/1 so in the first ResNet we do not downsample the output. Hence, the downsampling process goes (28, 28) -> (14, 14) -> (7, 7) -> (4, 4). We do not have to worry about the upsampling process since it is taken care of by the bilinear interpolation, which is much more flexible than deconvolutions.

The only thing left is to construct the model and train! All the rest is the same as before in the 2D case!

def make_model(n_filters, num_noises):
    Input_image = keras.Input(shape=(28, 28, 1))
    Input_label = keras.Input(shape=(1,))
   
    res1 = ResNetBlock(n_filters, num_noises= num_noises, downsampling = False)(Input_image, Input_label)
    res2 = ResNetBlock(n_filters, num_noises= num_noises)(res1, Input_label)
    res3 = ResNetBlock(n_filters, num_noises= num_noises)(res2, Input_label)
    res4 = ResNetBlock(n_filters, num_noises= num_noises)(res3, Input_label)

    RefineNet_4 = RefineNetBlock(im_in= 1,
                                input_features = n_filters,
                                num_noises= num_noises,
                                shape_target= (4, 4, 1))(image_big = res4,
                                                        image_small = None,
                                                        noise_level = Input_label)

    RefineNet_3 = RefineNetBlock(im_in= 2,
                                input_features = n_filters,
                                num_noises= num_noises,
                                shape_target= (7, 7, 1))(image_big = res3,
                                                        image_small = RefineNet_4,
                                                        noise_level = Input_label)

    RefineNet_2 = RefineNetBlock(im_in= 2,
                                input_features = n_filters,
                                num_noises= num_noises,
                                shape_target= (14, 14, 1))(image_big = res2,
                                                        image_small = RefineNet_3,
                                                        noise_level = Input_label)

    RefineNet_1 = RefineNetBlock(im_in= 2,
                                input_features = n_filters,
                                num_noises= num_noises,
                                shape_target= (28, 28, 1))(image_big = res1,
                                                        image_small = RefineNet_2,
                                                        noise_level = Input_label)


    #And eventually just a linear combination of the features to map the dimensionality of the input:

    final_conv = tf.keras.layers.Conv2D(1, 1, strides= 1)(RefineNet_1)

    F_model = tf.keras.Model([Input_image, Input_label], final_conv)

    return F_model

F_model = make_model(64, 10)

This takes a bit more to train so the weights are available here

The training loop:

# Now training loop 
optimizer = keras.optimizers.Adam(learning_rate= 1e-3)

loss_l = []
batch_size = 32

# Epochs loop
for t in range(50): 
    epoch_loss = []
    # Batches loop:
    for b in tqdm(range(0, (data_mnist.shape[0] - batch_size), batch_size)):
        
        data = data_mnist[b: b + batch_size]
        loss_batch = []

        labels = np.random.randint(0, num_noises, data.shape[0])
        
        with tf.GradientTape() as tape:
            loss = conditional_noise_loss(F_model, data, labels = labels)
            loss_batch.append(loss)
         

            model_grads = tape.gradient(loss, F_model.trainable_weights)

            optimizer.apply_gradients(zip(model_grads, F_model.trainable_weights))
            
    
    loss_l.append(np.mean(loss_batch))

And a sampler adapted to the MNIST:

def ald_sampling_mnist(F, sigmas, num_noises, iter, step_size, num_samples = 10):

    x_t = np.random.uniform(low = -1., high = 1, size = (num_samples, 28, 28, 1)) # Initial sample

    samples = [] # Placeholder

    # Loop over noise levels:
    for noise_level in tqdm(range(num_noises)):
        #print(f'noise level {noise_level}')
        alpha = step_size * (sigmas[noise_level]**2 / sigmas[-1]**2)
        # noise level inner sampling:
        for t in range(iter):
            z = np.random.normal(size = (num_samples, 28, 28, 1))
            gradient = F([x_t, np.array([[noise_level] * num_samples]).T]).numpy()
            
            x_t = x_t + (alpha/2) * gradient + np.sqrt(alpha) * z
            samples.append(x_t)

    return samples

Run it:

iter = 100
num_samples = 64
samples = ald_sampling_mnist(F_model, sigmas, num_noises, iter = iter, step_size = 2e-5, num_samples = num_samples)

And now some plotting…

plt.figure(figsize=(10, int(num_samples * 1.2)))

for row in range(samples[0].shape[0]):
    for j, i in enumerate(range(0, len(samples), iter)):
        plt.subplot(samples[0].shape[0], len(samples) // iter, 1 + j + row * (len(samples) // iter))
        plt.imshow(samples[i][row] * -1, interpolation='nearest', cmap='Greys')
        plt.grid(b=None)
        plt.axis('off')

plt.subplots_adjust(wspace=0, hspace=0)
plt.show()

What you see here is the sampling process. From noise to coherent images, following the process that we have described here. This is not different from the 2D example but is way cooler since you can see the sampler “navigating” this 28*28 dimensional space until it reaches a peak in the probability distribution of our data. Or what is the same, a coherent handwritten number? Having understood this, it is even more staggering the outputs of SOTA algorithms like Stable Diffusion. Amazing time to be alive isn’t it?

and this to generate the GIF!

from moviepy.editor import ImageSequenceClip

images = []
height = int(np.sqrt(samples[0].shape[0]))

for step in tqdm(range(0, len(samples), 5)):
    plt.figure(figsize=(8, 8))
    for i in range(samples[step].shape[0]):
        plt.subplot(height, height, 1 + i)
        plt.imshow(samples[step][i] * -1, interpolation='nearest', cmap='Greys')
        plt.grid(b=None)
        plt.axis('off')
    plt.subplots_adjust(wspace=0.05, hspace=0.05)
    plt.savefig(f'./Gif_folder/{step}.png')
    images.append(f'./Gif_folder/{step}.png')
    plt.close()

clip = ImageSequenceClip(images, fps = 50)
clip.write_gif('MNIST_SAMPLING.gif')

And that’s all for part 1. The next part is on the actual diffusion model, which is really similar but still has some differences.