## What is implicit stochastic gradient descent?

Implicit stochastic gradient descent is a modification of the
typical SGD algorithm as follows: Let $\theta_\star$ be the parameters for a model that produces
observations $y_t$. Then, implicit SGD defines the following iterative estimation procedure
*learning rate* of the procedure -- usually $a_t = \alpha / (\alpha + t)$,
and $C$ is an appropriate psd condition matrix (e.g., identity).
The update in this procedure is *implicit* because $\theta_t$ appears on both sides of the equation.
In general such updates are hard to perform, however for a large family of generalized linear models (GLM)
the implicit update can be computed very fast (see Algorithm 1, Toulis et.al. (2014)).
The proof is straightforward. In all (genelarized) linear models the log-likelihood (equivalently, the negative loss) is of the
form $\ell(\theta_t; y_t) \equiv \ell(\theta_t^\intercal x_t; y_t)$ for a vector of covariates (features) $x_t$.
Thus, $\nabla \ell(\theta_t^\intercal x_t; y_t) = \ell'(\theta_t^\intercal x_t; y_t) \cdot x_t = \frac{\ell'(\theta_t^\intercal x_t; y_t)}{\ell'(\theta_{t-1}^\intercal x_t; y_t)} \nabla \ell(\theta_{t-1}^\intercal x_t; y_t)$.
The implicit update (assume $C = \mathbf{I}$ for simplicity) is then

In the same paper, we establish that $\theta_t \to \theta_\star$ and also give asymptotic variance results.
In particular, if $t a_t \to \alpha > 0$, then for both standard and implicit SGD,
*curvature* of the regression function of the statistic used in the SGD iterate;
when we are using the gradient of the log-likelihood (also known as the score function), then the curvature coincides with the Fisher information under typical
regularity conditions. *Optimal* estimation is obtained only when $C = \mathcal{I}(\theta_\star)^{-1}$, in which case
the variance of $\theta_t$ is $(1/t) \mathcal{I}(\theta_\star)^{-1}$ i.e., the variance of the asymptotically optimal MLE.
I discuss more details in my tutorial on Stochastic Gradient Descent (under progress).

## Why implicit?

Implicit SGD has significant stability advantages over the classical SGD.
Intuitively, it is a *shrinked* version of the typical (explicit) SGD and the shrinkage factor
depends on the observed Fisher information. Thus, implicit SGD is significantly more stable than explicit SGD.
In the limit, both methods are statistically equivalent, but in small-to-moderate samples, implicit SGD is more stable,
and robust to misspecifications of the learning rate, or outliers. Implicit SGD can also be viewed as a stochastic proximal algorithm
since it solves

## Relevant theory and code

- Extensive theory and methods are in the paper,

Panos Toulis, Edoardo M. Airoldi,**"Implicit stochastic gradient descent for principled estimation with large datasets"**(under review) ( pdf) - Implicit stochastic gradient descent was originally introduced for generalized linera models (GLM)
in the following paper.

Panos Toulis, Jason Rennie, Edoardo Airoldi,**"Statistical analysis of stochastic gradient methods for generalized linear models"**, ICML, Beijing, China, 2014 ( pdf) **R code**implementing SGD and implicit SGD for any Generalized Linear Model (GLM) is currently maintained in the following Github link. For the experiments presented in the paper, look for the file`"examples/icml2014.R"`

.**C++ code**implementing implicit online learning methods for SVM is available as a standalone C++ source file and a MAKEFILE. To compile you will need Leon Bottou's SGD 2.0 code. Then download the implicit C++ file and its MAKEFILE and save them in the "svm/" folder in Bottou's codebase. You should be able to compile and run the "svmimplicit" program.