Volume 54, pp. 610-619, 2021.

On the regularization effect of stochastic gradient descent applied to least-squares

Stefan Steinerberger

Abstract

We study the behavior of the stochastic gradient descent method applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible matrices $A \in \mathbb{R}^{n \times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \mathbb{E}\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious inequality since the last term involves one additional matrix multiplication applied to the error $x_k - x$ compared to the remaining terms: if the projection of $x_k - x$ onto the subspace of singular vectors corresponding to large singular values is large, then the stochastic gradient descent method leads to a fast regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values acts as a regularizer.

Full Text (PDF) [804 KB], BibTeX

Key words

stochastic gradient descent, Kaczmarz method, least-squares, regularization

AMS subject classifications

65F10, 65K10, 65K15, 90C06, 93E24

< Back