Electronic Transactions on Numerical Analysis (ETNA)

Volume 54, pp. 610-619, 2021.

On the regularization effect of stochastic gradient descent applied to least-squares

Stefan Steinerberger

Abstract

We study the behavior of the stochastic gradient descent method applied to $‖ A x - b ‖_{2}^{2} \to min$ for invertible matrices $A \in R^{n \times n}$ . We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $E {‖ A x_{k + 1} - b ‖}_{2}^{2} \leq (1 + \frac{c_{A}}{‖ A ‖_{F}^{2}}) {‖ A x_{k} - b ‖}_{2}^{2} - \frac{2}{‖ A ‖_{F}^{2}} {‖ A^{T} A (x_{k} - x) ‖}_{2}^{2} .$ This is a curious inequality since the last term involves one additional matrix multiplication applied to the error $x_{k} - x$ compared to the remaining terms: if the projection of $x_{k} - x$ onto the subspace of singular vectors corresponding to large singular values is large, then the stochastic gradient descent method leads to a fast regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values acts as a regularizer.

Full Text (PDF) [804 KB], BibTeX , DOI: 10.1553/etna_vol54s610

Key words

stochastic gradient descent, Kaczmarz method, least-squares, regularization

AMS subject classifications

65F10, 65K10, 65K15, 90C06, 93E24

< Vol. 54 (2021)

Volumes 2021-2030

Volume 54, pp. 610-619, 2021.

On the regularization effect of stochastic gradient descent applied to least-squares

Abstract

Key words

AMS subject classifications