Learning rate (LR) is one of the most important hyperparameters to be tuned and holds key to faster and effective training of neural networks. Simply put, LR decides how much of the loss gradient is to be applied to our current weights to move them in the direction of lower loss.
new_weight = existing_weight - learning_rate * gradient
The step is simple. But as research has shown, there is so much that can be done to improve this step alone which has a profound influence on the training.
Note that CLR is very similar to Stochastic Gradient Descent with Warm Restarts (SGDR), which says, “CLR is closely-related to our approach in its spirit and formulation but does not focus on restarts.” The fastai library uses SGDR as the annealing schedule (with the idea of an LR finder from CLR).
Neural networks are full of parameters that need to be trained to accomplish a certain task. Training parameters typically mean finding and setting appropriate values in them, so that they minimize a loss function with each batch of training.
Traditionally, there has been broadly two approaches to setting the LR during training.
One LR for all parameters
Typically seen in SGD, a single LR is set at the beginning of the training, and an LR decay strategy is set (step, exponential etc.). This single LR is used to update all parameters. It is gradually decayed with each epoch with the assumption that with time, we reach near to the desired minima, upon which we need to slow down the updates so as not to overshoot it.
There are many challenges to this approach (refer):
- Choosing an initial LR can be difficult to set in advance (as depicted in above figure).
- Setting an LR schedule (LR update mechanism to decay it over time) is also difficult to be set in advance. They do not adapt to dynamics in data.
- The same LR gets applied to all parameters which might be learning at different rates.
- It is very hard to get out of a saddle point. See below.
Adaptive LR for each parameter
Improved optimizers like AdaGrad, AdaDelta, RMSprop and Adam alleviate much of the above challenges by adapting learning rates for each parameters being trained. With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule .
Cycling Learning Rate
CLR was proposed by Leslie Smith in 2015. It is an approach to LR adjustments where the value is cycled between a lower bound and upper bound. By nature, it is seen as a competitor to the adaptive LR approaches and hence used mostly with SGD. But it is possible to use it along with the improved optimizers (mentioned above) with per parameter updates.
CLR is computationally cheaper than the optimizers mentioned above. As the paper says:
Adaptive learning rates are fundamentally different from CLR policies, and CLR can be combined with adaptive
learning rates, as shown in Section 4.1. In addition, CLR policies are computationally simpler than adaptive learning rates. CLR is likely most similar to the SGDR method that appeared recently.
Why it works
As far as intuition goes, conventional wisdom says we have to keep decreasing the LR as training progresses so that we converge with time.
However, counterintuitively it might be useful to periodically vary the LR between a lower and higher threshold. The reasoning is that the periodic higher learning rates within the training help the model come out of any local minimas or saddle points if it ever enters into one. In fact, Dauphin et al.  argue that the difficulty in minimizing the loss arises from saddle points rather than poor local minima. If the saddle point happens to be an elaborate plateau, lower learning rates can never generate enough gradient to come out of it (or will take enormous time). That’s where periodic higher learning rates help with more rapid traversal of the surface.
A second benefit is that the optimal LR appropriate for the error surface of your model will in all probability lie between the lower and higher bounds as discussed above. Hence we do get to use the best LR when amortized over time.
Epoch, iterations, cycles and stepsize
These terms have specific meaning in this algorithm, understanding them will make it easy to plug them in equations.
Let us consider a training dataset with 50,000 instances.
An epoch is one run of your training algorithm across the entire training set. If we set a batch size of 100, we get 500 batches in 1 epoch or 500 iterations. The iteration count is accumulated over epochs, so that in epoch 2, we get iterations 501 to 1000 for the same batch of 500, and so one.
With that in mind, a cycle is defined as that many iterations where we want our learning rate to go from a base learning rate to a max learning rate, and back. And a stepsize is half of a cycle. Note that a cycle, in this case, need not fall on the boundary of an epoch, though in practice it does.
In the above diagram, we set a base lr and max lr for the algorithm, demarcated by the red lines. The blue line suggests the way learning rate is modified (in a triangular fashion), with the x-axis being the iterations. A complete up and down of the blue line is one cycle. And stepsize is half of that.
Calculating the LR
As we gather from the above, the following needs to be fed into the algorithm for it to work:
- number of iterations that we want in a stepsize (half of a cycle)
Later we will see that the optimal values of these can be programatically derived. Below is a piece of code which demonstrates the way LR is calculated:
def get_triangular_lr(iteration, stepsize, base_lr, max_lr): """Given the inputs, calculates the lr that should be applicable for this iteration""" cycle = np.floor(1 + iteration/(2 * stepsize)) x = np.abs(iteration/stepsize - 2 * cycle + 1) lr = base_lr + (max_lr - base_lr) * np.maximum(0, (1-x)) return lr # Demo of how the LR varies with iterations num_iterations = 10000 stepsize = 1000 base_lr = 0.0001 max_lr = 0.001 lr_trend = list() for iteration in range(num_iterations): lr = get_triangular_lr(iteration, stepsize, base_lr, max_lr) # Update your optimizer to use this learning rate in this iteration lr_trend.append(lr) plt.plot(lr_trend)
If you are a PyTorch user, note that there is a pull request currently open in PyTorch queue to add this learning rate scheduler in PyTorch.
Deriving the optimal base lr and max lr
An optimal lower and upper bound of the learning rate can be found by letting the model run for a few epochs, letting the learning rate increase linearly and monitoring the accuracy.
We run a complete step by setting stepsize equal to num_iterations (This will make the LR increase linearly and stop as num_iterations is reached). We also set base lr to a minimum value and max lr to a maximum value that we deem fit.
The accuracy plot will see an increase in accuracy as we increase the learning rate, but will plateau at a point and start decreasing again. Note the LR at which accuracy starts to increase, and also the LR when it starts stagnating. These are good points to set as base lr and max lr
Alternatively, you can note the LR where accuracy peaks, and use that as max lr. Set base lr as 1⁄3 or 1⁄4 of this.
Deriving the optimal cycle length (or stepsize)
The paper suggests, after experimentation, that the stepsize be set to 2-10 times the number of iterations in an epoch. In the previous example, since we had 500 iterations per epoch, setting stepsize from 1000 to 5000 would do. The paper found not much difference in setting stepsize to 2 times num of iterations in an epoch than 8 times so.
In addition to the triangular profile used above, the author also experimented with other functional forms.
triangular2: Here the max lr is halved every cycle to bring down the difference between base lr and max lr.
exp_range: Here the max lr is decayed exponentially with each iteration.
The amplitude is adjusted either at the end of each mini batch, or at the end of a cycle. These showed improvements in comparison with fixed learning rate and exponentially decaying learning rate respectively in the paper.
CLR may provide quicker convergence on certain neural net tasks and architectures, hence it is something to try out.
In the above test, CLR took 25k iterations to reach an accuracy of 81%, which was reached in 70,000 iterations using traditional LR techniques.
In another test, CLR with Nesterov optimizer converged much quicker than Adam.
CLR brings in a novel technique to manage the learning rate and can be used with SGD or with the advanced optimizers. CLR is one technique that should be in every deep learning practitioner’s tool box.
- Cyclical Learning Rates for Training Neural Networks, Smith
- An overview of gradient descent optimization algorithms, Rudder
- Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization.
- SGDR: Stochastic Gradient Descent with Warm Restarts, Loshchilov, Hutter