resource:github上备份的包括ipad标注的pdf版本。
作者是Intel Accelerator Architecture Lab的Asit Mishra、Eriko Nurvitadhi、Jeffrey J Cook和Debbie Marr(看起来像是印度作者)。
Summary:文章还可以,用非常朴素的方法打点,应该是读明白了(第一次comprehension打5.0,有丶激动)。文章创作的背景是当时大多数量化方法都关注weights(后简称为W)的量化,很少有(推测)对activation(后简称为ACT)做量化的研究,作者的出发点是在训练阶段,ACT占比很高,消耗了大部分内存,所以需要对ACT进行量化(但是我觉得这个出发点很站不住脚,因为在实际应用中推断是对一张图进行的(当然也不排除类mini-batch的应用场景,但是我目前还没有这种发现),所以这种内存消耗的分析非常无所谓,与其讲这个故事,不如说Binary W和FP ACT作用实际上还是FP OP,对加速而言没有多大帮助,倒还显得顺理成章)。作者提出的方案是扩大通道数量,给的例子是4 x A, 2 x B, 2 x Ch可以达到和FP model一样的精度。后面还有在GPU/FPGA/ASIC上的测试(不知道是仿真还是真做了),有一点启发(GPU上加速效果非常受限,ASIC因其非常实在的DIY支持所以加速效果最明显,甚至超过了理论值),还有些聊胜于无的quantization scheme的改进。
Rating: 3.0/5.0
Comprehension: 5.0/5.0
文章的贡献有:
提了个奇怪的说法,先前只binarize weights的方法一般只在batch size比较小的时候才能增益inference step:
Further, most prior works target reducing the precision of the model parameters (network weights).
This primarily benefits the inference step only when batch sizes are small.
(2021/5/25 update)想了下其实挺好理解的,因为BS大的时候基本上都是ACT,那么只Binarize Weight带来的增益其实挺小。
讲的是本文的故事,讲ACT内存优化确实不是个好的切入点。
这张图还有点用,说训练时内存占用是ACT、W和input gradient maps (δZ)与back-propagated gradients (δX)的最大者,最大还好理解(传一次就把上一步骤的梯度删了,但是不需要存一份更新参数吗?),前面δZ和δX的区别是啥?关于ACT和W的梯度?推断的时候需要的空间就是IFM和OFM各最大的内存块,可以理解。
原文:
The total memory requirements for training phase is the sum of memory required for the activation maps, weights
and the maximum of input gradient maps (δZ) and maximum of back-propagated gradients (δX). During inference,
memory is allocated for input (IFM) and output feature maps (OFM) required by a single layer, and these memory
allocations are reused for other layers. The total memory allocation during inference is then the maximum of IFM
and maximum of OFM required across all the layers plus the sum of all W-tensors.
ACT量化/加宽CH的另一个优点我没看懂:
Apart from other benefits of reduced precision activations as mentioned earlier, widening filter maps also
improves the efficiency of underlying GEMM calls for convolution operations since compute accelerators are
typically more efficient on a single kernel consisting of parallel computation on large data-structures as
opposed to many small sized kernels.
在量化后使用scaling factor,具体到binary的情形中,scaling factor用的是BWN的方案。
TTQ和DoReFa用了复杂的range arrange方法,本篇文章就用了简单的clip:
TTQ and DoRefa schemes involve division operation and computing a maximum value in the input tensor.
...
We avoid each of these costly operations and propose a simpler quantization scheme (clipping followed by rounding)
in practice the efficiency gains from reducing precision depend on whether the underlying hardware can take
advantage of such low-precisions
Reducing the precision simplifies the design of compute units and lower buffering requirements on FPGA board.
Compute-precision reduction leads to significant improvement in throughput due to smaller hardware designs
(allowing more parallelism) and shorter circuit delay (allowing higher frequency).
...
ASIC allows for a truly customized hardware implementation.
reducing the precision allows custom-designed compute units and lower buffering requirements to provide
significant improvement in throughput.