The seminal paper Understanding Deep Learning Requires Rethinking Generalization (Zhang et al.) includes a study in section 5 on implicit regularization of (stochastic) gradient descent in a linear model. Here, I explore this study and present a proof that I wrote on this statistics stackexchange post.

Setup (Zhang et al., Section 5)

Suppose we have a linear regression problem $$ y = Xw $$ with output $y$, matrix $X$ with $n$ observations and $d$ features, and we aim to learn parameters $w$ under squared error. We suppose here that $n<d$ and that $X$ has rank $n$. Thus, this system is underdetermined, and there are an infinite number of solutions for $w$. If we learn $w$ using stochastic gradient descent, the update with step size $\lambda$ is: $$ \begin{aligned} w_{t+1} &= w_t - \lambda \nabla_{w_t} (y - Xw_t)^2 \\ &= w_t - \lambda 2(y-Xw_t)X \\ &= w_t - \lambda e_t X \end{aligned} $$ for error $e_t = 2(y-Xw_t)$. Now, suppose we initialize $w_0 = 0$. Then, because each update subtracts a term with some coefficients multiplying $X$, at any update step $t$ we have that $w_t$ is in the span of $X$: $w_t = X^T \alpha$ for some coefficients $a$ (property A).

We consider an optimal solution $w$, where $Xw = y$. Combining this with property A, we have $XX^T \alpha = y$, which has a unique solution for $\alpha$: $XX^T$ has shape $(n, n)$ and rank $n$. Zhang et al. relate this to the kernel trick.

Furthermore, this solution is the minimum $\ell_2$-norm solution of $Xw=y$! This is the key proposition for which I derive a proof.

This is surprising, because in this underdetermined problem, where there are infinite solutions, we can see that SGD (with proper initialization) finds the minimum-norm solution.

Proof

We will prove the following: The optimal $w$ of the constrained optimization problem $$ \min_w |w |_2^2 \quad \textrm{ subject to } \quad y=Xw. $$ is in the span of $X$; that is, we have $w = X^T \alpha$ for some coefficients $\alpha$. From above, we know that there is only one unique solution satisfying $w = X^T \alpha$; therefore SGD initialized at zero finds the same $w$ as the constrained optimization problem with minimum $\ell_2$-norm.

We start by using a Lagrange multiplier to convert this to an unconstrained optimization: $$ \min_w |w|^2_2 + \lambda (y - Xw) $$ with Lagrange multiplier $\lambda$. The optimal $w$ is found by the system of equations generated by taking the gradient of this Lagrangian with respect to $w$ (first line) and with respect to $\lambda$ (second line), and setting both to 0: $$ \begin{cases} 2w - X^T \lambda &= 0 \\ y-Xw &= 0 \end{cases} $$ Take the first line and left multiply by $X$: $$ 2Xw - XX^T \lambda = 0 $$ Using $y=Xw$ (the second line), we have: $$ \begin{aligned} 2y &= XX^T \lambda \\ 2(XX^T)^{-1} y &= (XX^T)(XX^T)^{-1} \lambda \\ 2(XX^T)^{-1} y &= \lambda \end{aligned} $$ Using our first line optimality condition $2w - X^T \lambda = 0$, we have: $$ w = X^T (XX^T)^{-1} y $$ where $X^T (XX^T)^{-1}$ is the right pseudo-inverse of $X$. This equation shows $w$ is in the span of $X$: $w=X^T \alpha$ for $\alpha = (XX^T)^{-1} y$. We can verify this makes sense, because $XX^T$ is invertible (it has shape $n\times n$ and has rank $n$). This completes the proof.

Remarks

This result critically depends on initializing $w$ to $0$, which isn’t really done in practice. It also isn’t special to SGD - the stochastic part doesn’t appear here, so it holds equally for gradient descent.