Artificial Intelligence in Plain English

New AI, ML and Data Science articles every day. Follow to join our 3.5M+ monthly readers.

Follow publication

Averaging Weights Leads to Wider Optima and Better Generalization

--

Wider optima are goooooood. Source

Article Overview

Prerequisites

What’s up with ensembling?

Width of a local optima

Source

Why ensembling techniques works like a charm

Multiple loss surfaces, when working together can act as one surface with flatter optimum.

The general explanation for the importance of width is that the surfaces of train loss and test error are shifted with respect to each other and it is thus desirable to converge to the modes of broad optima, which stay approximately optimal under small perturbations.

Introducing the paper

We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model.

From FGE paper, showing that the optima obtained by SGD can be joined by a curve.

The Nub

The Working

We show that SGD with cyclical and constant learning rates traverses regions of weight space corresponding to high-performing networks. We find that while these models are moving around this optimal set they never reach its central points. We show that we can move into this more desirable space of points by averaging the weights proposed over SGD iterations. SGD generally converges to a point near the boundary of the wide flat region of optimal points. SWA on the other hand is able to find a point centered in this region, often with slightly worse train loss but with substantially better test error. SWA is extremely easy to implement and has virtually no computational overhead compared to the conventional training schemes.

The cyclic learning they used in the paper, where i is the training iteration.
The circles above represent many local optima obtained by the model when trained with a version of cyclic learning rate.
Yes, very straight forward :)
The update rule for the final model, where n_models represents the number of cycles or number of models with local minima we will obtain within a total of n training iterations.

Note

Results

It’s just too damn easy for the SWA to beat previous state of the art results without any significant performance overhead.

My Observation

While being able to train a DNN with a fixed learning rate is a surprising property of SWA, for practical purposes we recommend initializing SWA from a model pretrained with conventional training (possibly for a reduced number of epochs), as it leads to faster and more stable convergence than running SWA from scratch.

Conclusion

--

--

Published in Artificial Intelligence in Plain English

New AI, ML and Data Science articles every day. Follow to join our 3.5M+ monthly readers.

Written by Rishik C. Mourya

2nd yr Undergrad | ML Engineer at Nevronas.in | Working at Skylarklabs.ai | Web Developer | When you know nothing matters, the universe is yours — Rick Sanchez

No responses yet