## What is implicit stochastic gradient descent?

Let $z_t = (x_t, y_t)$ be i.i.d. observations of a model with parameters $\theta_\star$.
To estimate $\theta_\star$ through classical stochastic gradient descent (SGD) amounts to the procedure
*learning rate* of the procedure (e.g., $\gamma_t = \gamma / (\gamma_0 + t))$,
and $C$ is an appropriate symmetric psd condition matrix.
This approach is well-studied and has been used extensively in applications.
However, one major issue is stability: classical SGD is heavily affected by misspecifications of the learning rate $\gamma_t$,
and can often explode. Currently, this issue is solved in an ad-hoc way (e.g., projected SGD, gradient clipping, robust transformations, etc.)
**Implicit SGD** is a modification of the classical algorithm that aims to solve this problem in a principled way.
Often, we can write the method as follows
*implicit* because $\theta_t$ appears on both sides of the equation.
In general such updates are hard to compute, however for a large family of statistical models (e.g., generalized linear models)
the implicit update can be computed very fast (see Algorithm 1, Toulis et.al. (2014)).
Running implicit SGD is then equivalent to running
*curvature* of the regression function of the statistic used in the SGD iterate;
in the typical cases shown above, this curvature coincides with the Fisher information. *Optimal* estimation is obtained only when $C = \mathcal{I}(\theta_\star)^{-1}$, in which case
the variance of $\theta_t$ is $(1/t) \mathcal{I}(\theta_\star)^{-1}$ i.e., the variance of the asymptotically optimal MLE.
I discuss more details in my tutorial on Stochastic Gradient Descent (under progress).

## Why implicit?

Implicit SGD has significant stability advantages over the classical SGD.
Intuitively, it is a *shrinked* version of the typical (explicit) SGD and the shrinkage factor
depends on the observed Fisher information. Thus, implicit SGD is significantly more stable than explicit SGD.
In the limit, both methods are statistically equivalent, but in small-to-moderate samples, implicit SGD is more stable,
and robust to misspecifications of the learning rate, or outliers. Implicit SGD can also be viewed as a stochastic proximal algorithm
since it solves
**"Implicit stochastic approximation" ** (under review)
( pdf)
I show how implicit SGD is a special instance of a more general implicit stochastic approximation procedure, much like
how classical SGD is a special instance of stochastic approximation.

## Relevant theory and code

- Implicit Stochastic Approximation. (Panos Toulis, Edoardo M. Airoldi,
**"Implicit stochastic approximation"**( pdf) -
For an implementation of implicit SGD procedures take a look at the
**sgd**R package here. - Extensive theory and methods are in the paper,

Panos Toulis, Edoardo M. Airoldi,**"Implicit stochastic gradient descent for principled estimation with large datasets"**(under review) ( pdf) - Implicit stochastic gradient descent was originally introduced for generalized linear models (GLM)
in the following paper.

Panos Toulis, Jason Rennie, Edoardo Airoldi,**"Statistical analysis of stochastic gradient methods for generalized linear models"**, ICML, Beijing, China, 2014 ( pdf) **R code**implementing SGD and implicit SGD for any Generalized Linear Model (GLM) is currently maintained in the following Github link. For the experiments presented in the paper, look for the file`"examples/icml2014.R"`

.**C++ code**implementing implicit online learning methods for SVM is available as a standalone C++ source file and a MAKEFILE. To compile you will need Leon Bottou's SGD 2.0 code. Then download the implicit C++ file and its MAKEFILE and save them in the "svm/" folder in Bottou's codebase. You should be able to compile and run the "svmimplicit" program.