1 Introduction
Many optimization questions arising in machine learning can be cast as a finite sum optimization problem of the form:
where. Most neural network problems also fall under a similar structure where each function
is typically nonconvex. A wellstudied algorithm to solve such problems is Stochastic Gradient Descent (SGD), which uses updates of the form: , where is a step size, and is a function chosen randomly from at time . Often in neural networks, “momentum” is added to the SGD update to yield a twostep update process given as: followed by . This algorithm is typically called the HeavyBall (HB) method (or sometimes classical momentum), with called the momentum parameter [polyak1987introduction]. In the context of neural nets, another variant of SGD that is popular is Nesterov’s Accelerated Gradient (NAG), which can also be thought of as a momentum method [sutskever2013importance], and has updates of the form followed by (see Algorithm 1 for more details).Momentum methods like HB and NAG have been shown to have superior convergence properties compared to gradient descent in the deterministic setting both for convex and nonconvex functions [nesterov1983method, polyak1987introduction, zavriev1993heavy, ochs2016local, o2017behavior, jin2017accelerated]. While (to the best of our knowledge) there is no clear theoretical justification in the stochastic case of the benefits of NAG and HB over regular SGD in general [yuan2016influence, kidambi2018on, wiegerinck1994stochastic, orr1994momentum, yang2016unified, gadat2018stochastic], unless considering specialized function classes [loizou2017momentum]; in practice, these momentum methods, and in particular NAG, have been repeatedly shown to have good convergence and generalization on a range of neural net problems [sutskever2013importance, lucas2018aggregated, kidambi2018on].
The performance of NAG (as well as HB and SGD), however, are typically quite sensitive to the selection of its hyperparameters: step size, momentum and batch size [sutskever2013importance]. Thus, “adaptive gradient” algorithms such as RMSProp (Algorithm 2) [tieleman2012lecture] and ADAM (Algorithm 3) [kingma2014adam] have become very popular for optimizing deep neural networks [melis2017state, xu2015show, denkowski2017stronger, gregor2015draw, radford2015unsupervised, bahar2017empirical, kiros2015skip]
. The reason for their widespread popularity seems to be the fact that they are believed to be easier to tune than SGD, NAG or HB. Adaptive gradient methods use as their update direction a vector which is the image under a linear transformation (often called the “diagonal preconditioner”) constructed out of the history of the gradients, of a linear combination of all the gradients seen till now. It is generally believed that this “preconditioning” makes these algorithms much less sensitive to the selection of its hyperparameters. A precursor to these RMSProp and ADAM was outlined in
[duchi2011adaptive].Despite their widespread use in neural net problems, adaptive gradients methods like RMSProp and ADAM lack theoretical justifications in the nonconvex setting  even with exact/deterministic gradients [bernstein2018signsgd]. Further, there are also important motivations to study the behavior of these algorithms in the deterministic setting because of usecases where the amount of noise is controlled during optimization, either by using larger batches [martens2015optimizing, de2017automated, babanezhad2015stop]
or by employing variancereducing techniques
[johnson2013accelerating, defazio2014saga].Further, works like [wilson2017marginal] and [keskar2017improving] have shown cases where SGD (no momentum) and HB (classical momentum) generalize much better than RMSProp and ADAM with stochastic gradients. [wilson2017marginal]
also show that ADAM generalizes poorly for large enough nets and that RMSProp generalizes better than ADAM on a couple of neural network tasks (most notably in the characterlevel language modeling task). But in general it’s not clear and no heuristics are known to the best of our knowledge to decide whether these insights about relative performances (generalization or training) between algorithms hold for other models or carry over to the fullbatch setting.
A summary of our contributions
In this work we try to shed some light on the above described open questions about adaptive gradient methods in the following two ways.

To the best of our knowledge, this work gives the first convergence guarantees for adaptive gradient based standard neuralnet training heuristics. Specifically we show runtime bounds for deterministic RMSProp and ADAM to reach approximate criticality on smooth nonconvex functions, as well as for stochastic RMSProp under an additional assumption.
Recently, [reddi2018convergence] have shown in the setting of online convex optimization that there are certain sequences of convex functions where ADAM and RMSprop fail to converge to asymptotically zero average regret. We contrast our findings with Theorem in [reddi2018convergence]. Their counterexample for ADAM is constructed in the stochastic optimization framework and is incomparable to our result about deterministic ADAM. Our proof of convergence to approximate critical points establishes a key conceptual point that for adaptive gradient algorithms one cannot transfer intuitions about convergence from online setups to their more common use case in offline setups.

Our second contribution is empirical investigation into adaptive gradient methods, where our goals are different from what our theoretical results are probing. We test the convergence and generalization properties of RMSProp and ADAM and we compare their performance against NAG on a variety of autoencoder experiments on MNIST data, in both full and minibatch settings. In the fullbatch setting, we demonstrate that ADAM with very high values of the momentum parameter (
) matches or outperforms carefully tuned NAG and RMSProp, in terms of getting lower training and test losses. We show that as the autoencoder size keeps increasing, RMSProp fails to generalize pretty soon. In the minibatch experiments we see exactly the same behaviour for large enough nets. We further validate this behavior on an image classification task on CIFAR10 using a VGG9 convolutional neural network, the results to which we present in the Appendix
LABEL:vgg9_sec.We note that recently it has been shown by [lucas2018aggregated], that there are problems where NAG generalizes better than ADAM even after tuning . In contrast our experiments reveal controlled setups where tuning ADAM’s closer to than usual practice helps close the generalization gap with NAG and HB which exists at standard values of .
Remark.
Much after this work was completed we came to know of a related paper [li2018convergence] which analyzes convergence rates of a modification of AdaGrad (not RMSPRop or ADAM). After the initial version of our work was made public, a few other analysis of adaptive gradient methods have also appeared like [chen2018convergence], [zhou2018convergence] and [zaheer2018adaptive].
2 Notations and Pseudocodes
Firstly we define the smoothness property that we assume in our proofs for all our nonconvex objectives. This is a standard assumption used in the optimization literature.
Definition 1.
smoothness If is at least once differentiable then we call it smooth for some if for all the following inequality holds,
We need one more definition that of squareroot of diagonal matrices,
Definition 2.
Square root of the Penrose inverse If and then we define, , where is the standard basis of
Now we list out the pseudocodes used for NAG, RMSProp and ADAM in theory and experiments,
Nesterov’s Accelerated Gradient (NAG) Algorithm
RMSProp Algorithm
ADAM Algorithm
3 Convergence Guarantees for ADAM and RMSProp
Previously it has been shown in [rangamani2017critical] that minibatch RMSProp can offtheshelf do autoencoding on depth autoencoders trained on MNIST data while similar results using nonadaptive gradient descent methods requires much tuning of the stepsize schedule. Here we give the first result about convergence to criticality for stochastic RMSProp albeit under a certain technical assumption about the training set (and hence on the first order oracle). Towards that we need the following definition,
Definition 3.
The sign function
We define the function s.t it maps .
Theorem 3.1.
Stochastic RMSProp converges to criticality (Proof in subsection LABEL:sec:supp_rmspropS) Let be smooth and be of the form s.t. (a) each is at least once differentiable, the gradients are s.t , , (c) is an upperbound on the norm of the gradients of and (d) has a minimizer, i.e., there exists such that . Let the gradient oracle be s.t when invoked at some it uniformly at random picks and returns, . Then corresponding to any and a starting point for Algorithm 2, we can define, s.t. we are guaranteed that the iterates of Algorithm 2 using a constant steplength of, will find an critical point in at most steps in the sense that, .∎
Comments
There are no comments yet.