Panagiotis (Panos) Toulis
Harvard University,
Department of Statistics
ptoulis at fas dot harvard dot edu

Homepage Implicit SGD Publications Code

What is implicit stochastic gradient descent?

Implicit stochastic gradient descent is a modification of the typical SGD algorithm as follows: Let $\theta_\star$ be the parameters for a model that produces observations $y_t$. Then, implicit SGD defines the following iterative estimation procedure $\theta_t = \theta_{t-1} + a_t C \nabla \ell(\theta_t; y_t)$, where $a_t$ is the learning rate of the procedure -- usually $a_t = \alpha / (\alpha + t)$, and $C$ is an appropriate psd condition matrix (e.g., identity). The update in this procedure is implicit because $\theta_t$ appears on both sides of the equation. In general such updates are hard to perform, however for a large family of generalized linear models (GLM) the implicit update can be computed very fast (see Algorithm 1, Toulis (2014)). The proof is straightforward. In all (genelarized) linear models the log-likelihood (equivalently, the negative loss) is of the form $\ell(\theta_t; y_t) \equiv \ell(\theta_t^\intercal x_t; y_t)$ for a vector of covariates (features) $x_t$. Thus, $\nabla \ell(\theta_t^\intercal x_t; y_t) = \ell'(\theta_t^\intercal x_t; y_t) \cdot x_t = \frac{\ell'(\theta_t^\intercal x_t; y_t)}{\ell'(\theta_{t-1}^\intercal x_t; y_t)} \nabla \ell(\theta_{t-1}^\intercal x_t; y_t)$. The implicit update (assume $C = \mathbf{I}$ for simplicity) is then $\theta_{t} = \theta_{t-1} + a_t \nabla \ell(\theta_t^\intercal x_t; y_t) = \theta_{t-1} + a_t \xi_t \nabla \ell(\theta_{t-1}^\intercal x_t; y_t)$, where $\xi_t = \ell'(\theta_{t-1}^\intercal x_t + a_t \xi_t z_t; y_t) / \ell'(\theta_{t-1}^\intercal x_t; y_t), \quad z_t = \nabla \ell(\theta_{t-1}^\intercal x_t; y_t)^\intercal x_t = \ell'(\theta_{t-1}^\intercal x_t; y_t) ||x_t||^2$. Thus the implicit update also uses the gradient at $\theta_{t-1}$ but it is scaled by $\xi_t \in \mathbf{R}$. This parameter is also obtained by an implicit equation but it is one-dimensional so it is much easier to obtain (narrow search bounds are also available). Furthermore, as $a_t \to 0$ by a Taylor approximation we obtain $\xi_t \approx (1 - a_t ||x_t||^2 \ell''(\theta_{t-1}^\intercal x_t; y_t))^{-1} = (1 + a_t \mathrm{trace} \mathcal{I}(\theta_{t-1}; x_t, y_t))^{-1}$ where $\mathcal{I}(\theta_{t-1}; x_t, y_t) = -\nabla^2 \ell(\theta_{t-1}^\intercal x_t; y_t)$ is the Hessian of the negative log-likelihood, also known as the observed Fisher information at data $(x_t, y_t)$. The implicit updates use second order information and combine the computational efficiency of first-order methods with the stability of second-order ones; moreover, they are very easy to implement and costless to perform.
In the same paper, we establish that $\theta_t \to \theta_\star$ and also give asymptotic variance results. In particular, if $t a_t \to \alpha > 0$, then for both standard and implicit SGD, $t \cdot \mathbf{E}(\theta_t - \theta_\star) (\theta_t - \theta_\star)^\intercal \to \alpha^2 (2\alpha C \mathcal{I}(\theta_\star) - \mathbb{I})^{-1} C \mathcal{I}(\theta_\star) C^\intercal, \quad \quad (1)$ where $\mathcal{I}(\theta) = \mathbf{E} (\nabla \ell(\theta; y) \nabla \ell(\theta; y)^\intercal) = - \mathbf{E}(\nabla^2 \ell(\theta; y))$ is the Fisher information matrix of the model, and $\mathbb{I}$ is the identity matrix -- form (1) is proved here (note: $C$ needs to commute with $\mathcal{I}(\theta_\star)$). Result (1) reveals that SGD procedures incur an information loss by a factor $\alpha^2 (2\alpha C \mathcal{I}(\theta_\star) - \mathbb{I})^{-1}$. In order to reduce the information loss, the condition matrix needs to account for the Fisher information; in the general case, the condition matrix $C$ needs to account for the curvature of the regression function of the statistic used in the SGD iterate; when we are using the gradient of the log-likelihood (also known as the score function), then the curvature coincides with the Fisher information under typical regularity conditions. Optimal estimation is obtained only when $C = \mathcal{I}(\theta_\star)^{-1}$, in which case the variance of $\theta_t$ is $(1/t) \mathcal{I}(\theta_\star)^{-1}$ i.e., the variance of the asymptotically optimal MLE. I discuss more details in my tutorial on Stochastic Gradient Descent (under progress).

Why implicit?

Implicit SGD has significant stability advantages over the classical SGD. Intuitively, it is a shrinked version of the typical (explicit) SGD and the shrinkage factor depends on the observed Fisher information. Thus, implicit SGD is significantly more stable than explicit SGD. In the limit, both methods are statistically equivalent, but in small-to-moderate samples, implicit SGD is more stable, and robust to misspecifications of the learning rate, or outliers. Implicit SGD can also be viewed as a stochastic proximal algorithm since it solves $\theta_t = \arg \max_{\theta} \{ a_t \ell(\theta; y_t) - \frac{1}{2} ||\theta-\theta_{t-1}||^2 \}$. This formulation has a Bayesian interpretation in which $\theta_t$ is the posterior mode, where the prior distribution of $\theta_t$ is a normal $\mathcal{N}(\theta_{t-1}, a_t 1)$, and the observation $y_t$ has likelihood $\exp(\ell(\theta; y_t))$; note that the prior is shrinking towards $\theta_{t-1}$ as $t\to\infty$. Naturally, there also connections to proximal methods in optimization.

Relevant theory and code