resource:github上备份的包括ipad标注的pdf版本。
文章的作者是Oxford的Milad Alizadeh, Javier Fernandez-Marqu ´ es, Nicholas D. Lane & Yarin Gal,做这种empirical study是想进军BNN?但是为什么后面没声音了?讲道理我要是能发一篇ICLR就算成功,可以跑路了。
Summary:文章还不错,挺有insight的,而且有一些比较想要的结果?可我为什么还是觉得做empirical study的工作很水。 rating 3.5/5 吧。文章挺散的,讲了很多点。大概说在end-to-end训练中Adam优化器挺好使的,gradient&weight clipping在训练早期太慢,可以先放一放(改用vanilla STE)然后在训练后期再用standard binary STE,或者将训练拆成2-stage,对pretrained model finetune。因为accuracy在训练中出现plateau的地方和最高的accuracy都一定距离,所以不要用早停。
文章的贡献主要是提供了一些insight:
Gradient clipping stops gradient flow if the weight's magnitude is larger than 1.
This effectively means gradients are computed with respect to hard tanh function.
(2)weight clipping,当weights由gradient更新之后将它们限制到一个范围里:
Weight clipping is applied to weights after gradients have been applied to keep them within a range.
总结了下第三节的内容,挺有用的:
首先介绍了4种不同的optimizer,这是我所不了解的:
(1) history-free optimisers such as mini-batch SGD that do not take previous jumps or gradients into account,
(2) momentum optimisers that maintain and use a running average of previous jumps such as Momentum
and Nesterov (Sutskever et al., 2013),
(3) Adaptive optimisers that adjust learning rate for each parameter separately such as AdaGrad
(Duchi et al., 2011) and AdaDelta (Zeiler, 2012), and finally,
(4) optimisers that combine elements from categories above such as ADAM which combines momentum with
adaptive learning rate.
第一种应该是最朴素的optimizer,是和过去历史无关的optimizer,和过去的jumps
或者gradient
没有关系,比如mini-batch SGD
;第二种是包括了过去的jumps
的running average
的optimizer,比如Momentum
和Nesterov
;第三种是可以为param分别调LR的optimizer,比如AdaGrad
和AdaDelta
;第四种应该是集大成者,比如ADAM
。
下面的这些摘录很重要,非常insightful。
Our first observation is that vanilla SGD generally fails in optimising binary models using STE. We note that
reducing SGD's stochasticity (by increasing batch size) improves performance initially. However, it still fails
to obtain the best possible accuracy. SGD momentum and Nesterov optimisers perform better than SGD when they are
carefully fine-tuned. However, they perform significantly slower compared to optimsing non-binary models and have
to be used for many more epochs than normally used for CIFAR-10 and MNIST datasets. Similar to SGD, increasing
momentum rate improves training speed significantly but results in worse final model accuracy.
首先说vanilla SGD基本上不能优化使用STE的BNN,通过增加batch size以降低SGD的随机性在开始时可以改善性能,但是后面就拉了。SGD momentum
和Nesterov
相比SGD
性能要好些,但是收敛速度要慢非常多。和SGD
类似,增加momentum rate可以加快训练,但是会掉点。
A possible hypothesis is that early stages of training binary models require more averaging for the
optimiser to proceed in presence of binarisaton operation. On the other hand, in the late stages of the
training, we rely on noisier sources to increase exploration power of the optimiser. This is reinforced
by our observation that binary models are often trained long after the training or validation accuracy
stop showing improvements. Reducing the learning rate in these epochs does not improve things
either. Yet, the best validations are often found in these epochs. In other words, using early stopping
for training binary models would terminate the training early on and would result in suboptimal
accuracies.
一个合理的假设是,在训练早期需要更平均一些(比如batch size更大一些?),在训练的后期需要更加noisy来increase exploration。一个观察:BNN在停止涨点(出现plateau)之后还要再训训,而且在这时降低LR也不起作用(可是最好的模型会在这附近出现)——>不能用早停。
SGD
和Momentum
的速度影响不大,但是ADAM对这种限制敏感。Reducing the momentum rate in BN can help to cancel the effect of long training. The effect is small
but consistent.
In BinaryConnect Courbariaux et al. (2015) propose scaling
learning rates of each convolutional or fully connected layer by the inverse of Xavier initialisation's
variance value. The same value is also used as the range in weight clipping after gradient update.
While weight and gradient clipping help achieve better accuracy, our hypothesis is that they are only required
in the later stages of training where the noise added by clipping weights and gradients increases the exploration
of the optimiser.
while we can quickly get to the point where training and validation accuracies stagnate, there is a small gap
between the achieved accuracy and the best possible one. This gap can only be filled by continuing training for
many epochs.
同时argue"最后一公里"和STE capacity无关,倒和对param空间的随机探索有关:
last mile of model performance has little dependence on the STE's capability and mostly relies on a
stochastic exploration of the parameter space
待补充。