ASP-DAC21 Tutorial

2022/1/5

Talk1 - by Zidong Du

Dedicated NN processors
- 提到GPU for Image Processing, DSP for Signal Processing, __ for Intelligence Processing的观点
- *市场巨大
2 set of research work(DNN related architecture work):
- DianNao
- Cambricon Series
Training is critical to AI application
- time-consuming
Low bit width
- smaller size of memory accesses
- faster computing
- smaller area of hardware
- 现有的硬件厂商已部分支持低比特推理（quantization is widely used in inference）
- 有一张backward量化的图，没太懂（top_diff & bottom_diff）：
- 数据分布与量化区间的三种关系：
  - 绿色分布合理，rounding error与clipping error都比较小
  - 红色rounding error较大
  - 蓝色clipping error较大
Quantization Algorithm
- 现有的算法需要根据统计数字确定量化范围，需要额外的GPU/CPU access开销，因此没有实际的加速效果
- Quantized Training on GPU - 因此在GPU上Int8训练比FP32训练还慢
  - 缺少软硬件支持（量化）
  - 缺少低比特训练的有效算法支持（G也被量化）
- 参数梯度差异显著
  - 不同层间梯度可差2数量级，不同epoch间梯度可差3数量级 -> 需要statistical info来确定量化范围
  - epoch间梯度量级变化迅速 -> 需要dynamic quantization
CPU + ACC（hardware）- Dynamic on-the-fly Quantization Training
- FP前向与反向比较简单
  - 前向f()很好理解
  - g() -> computing gradients on neurons是指? 关于act的梯度？
  - k() -> computing gradients on weights
  - h() -> update weights
- 量化前向与反向更加复杂
  - 前向时需要由CPU量化W和I，会拖慢速度，而且消耗更多数据获取资源
- 简单的解决方案是在ACC端加个统计单元和量化单元：
  - 不是最佳方案，因为基于统计数据的量化方案要求统计数据信息，涉及额外的搬运开销(two-pass data accesses statistic + quantization)
    - 除了往statistic unit和quantization unit搬，还有把量化中间结果搬回DDR的开销
- Local Dynamic Quantization
  - 普通基于统计数据的量化需要算一个全局统计特性$\theta$出来，bottleneck
  - 解决方法是locally（把数据分block）计算$\theta$，rounding error会更小 -> 这是把scale粒度变得更小了，scale不就更多了？
- Multi-way Quantization(E^2BQM)
  - 多路量化，根据不同准则（bit-width, loss）选择用什么方式量化
- In-place Weight Update
  - 在DRAM端做参数更新
- Cambricon-Q - DNN training arch
  - (没细听，听了也不懂)黄框是和上面三个trick结合的结果，三个components(SQU, QBC, NDPO)具体是:
    - SQU - on-the-fly statistics counting & quantization
    - QBC - tackling neighboring data that may be split into 2 independent quantization processor with different parameters
    - NDPO - weight update
  - Cambricon-Q数据通路：
Conclusion
- Cambricon-Q: A Hybrid Architecture for Efficient Training
  - incorporate 3 units
  - targeting @ on-the-fly statistics quantization DNN training
  - Quantized Training
    - Statistic info
    - Dynamic quant
    - High-precision weight update
  - Existing platform
    - fake quant
    - 2-pass data accesses

Talk2 - by Haojin Yang

经典的BNN/QNN故事：
- Deep Learning Models is expensive
  - model is large, computation is extensive
  - model training is not environment-friendly
- DL on Mobile Devices
QAT & BNN基础与收益 - 无新意
- Im2col for GEMM: 将输入图片映射到colume
- BNN inference
  - Binarizing activations row-wise, weights column-wise
  - 这张图不错，FP infer和Bi infer：
  - 由于硬件的限制，还是需要少量加法
Challenges of Binary Neural Networks
- Loss of accuracy
- BNN's tailor-made optimizer
- Balancing accuracy and energy consumption
- Lack of support for solid inference acceleration on heterogeneous hardware
Classic BNNs（对应于上述第一个问题，减少精度损失）
- XNOR-Net
  - 一些over-claim：
    - 加速58x的claim言过其实（似乎是对比的某种实现在实际中不常用）
    - 并不能在CPU上实时运行
  - 一些没说明的重要细节：
    - 用了1x1 FP downsampling layers
- ABC-Net & GroupNet
  - 用很多binary bases近似FP value
  - 计算复杂，且不一定有实际加速效果
- Bi-real Net
  - Binary-real valued information flow design(密集跳连、BN)
  - approx sign func
  - 2-stage training
- BinaryDenseNet
- ReActNet
  - based on MobileNet V1
  - channel-wise reshaping & shifting
  - training tricks - KD,etc.
BNN Optimizer
- 依赖latent weights
- 问题
  - Mismatching of optimization objective
  - unnecessary computation
- Progressive Binarization
  - （这篇文章似乎不错，可看？）TPAMI21 - Gradient Matters: Designing Binarized Neural Networks via Enhanced Information-Flow
  - 逐渐从32bit过渡到1bit
Balancing Accuracy and Energy Consumption
- BoolNet
BNN frameworks
Future Directions
- Narrow the gap to full precision counterparts
  - Dedicated architecture design
  - Optimizer for BNN
    - The low-dimensional space is mapped to the high-dimensional space for optimization
    - Adjusts the expression space of binary network
- Algorithm-Hardware co-design
  - Open platform for BNN performance evaluation on accelerators

Talk3 - by Kai Han

PTQ
- 主要问题还是确定$s_x$和$s_w$：
- ACIQ
- OCS
- PTQ for ViT
  - ViT的量化问题：量化后self-attention信息丢失
  - 解决方案：
    - Ranking-aware quantization (To preserve the functionality)
    - Person Coefficient(Maximize the similarity of the features)
    - Bias Correction(这里简单介绍了bias correction的作用，23：45-25：00)
      - 目的是为了减少bias error，如果输出error的期望不是0，则输出的均值会发生改变，这种分布的偏移会导致后续层表现异常（Q:这种quant error非零的情况应该不能通过BN校正吧？）
      - 有一说一这里bias correction写得挺好的，同时考虑了W和X的量化误差：
    - Mixed-precision
QAT
- 将4bit-QAT性能不好归因于the weights are unmatched to the low-bit setting
- DoReFa-Net
- PACT
- LSQ
BNN
- FDA-BNN
Arch for Low-bit
- Wider Channel
- NAS + Quant
- NAS + BNN - BATS
Conclusion