At the nurture.ai’s NIPS paper implementation challenge, I implemented and validate the paper ‘Training Deep Networks without Learning Rates Through Coin Betting’ using PyTorch. (github)
This paper caught my attention due to it’s promise to get rid of the learning rate hyper-parameter during model training.
The paper says: In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning-rate-free optimal algorithm for this scenario.
Let us revisit the purpose of the learning rate. It defines the step size to move towards the direction of lower gradient.
new_weight = existing_weight - learning_rate * gradient
In the absense of a learning rate, the optimizer bets on the direction of the gradient, and it’s magnitude. If is is correct in it’s prediction, it is rewarded. If it is wrong, it is penalized and compeled to self correct.
The central idea of the paper is that of coin betting/gambling.
- You (the optimizer) starts with an initial amount of money epsilon.
- At every time instant (iteration) and for each parameter, the optimizer makes a bet as to what that gradient would be in terms of magnitude and sign in the next iteration. This is denoted by the term wi. A sign of -ve would indicate tails, and +ve would indicate heads.
- The optimizer has to make do with the amount of money that was given to it initially. It cannot borrow any more money.
- In the next iteration when the actual results come: if he loses, he loses the betted amount; if he wins, he gets the betted amount back and in addition to that, he gets the same amount as a reward. The advantage of wining is that his corpus of money increases and he can bet more for the next iteration.
- A couple of terms are introduced: Wealth and Reward
Wealth increases if wi (the bet) and gi (the actual gradient) are both either positive or negative - which indicates correct prediction. The reward obtained is all the wealth minus the initial corpus of money.
With these, the optimizer can make a bet for the next iteration like so:
Where the beta term denotes the percentage of current wealth the optimizer is willing to bet for the next iteration. It’s sign of +ve or -ve will determine if it is calling heads (+ve gradient) or tails (-ve gradient). It is drawn from [-1, 1].
Here is a result of using this optimizer for mnist task:
More details at my github repo: https://github.com/anandsaha/nips.cocob.pytorch