Quantization

Tutorial: https://www.youtube.com/watch?v=0VdNflU08yA

There wo main types of network quantizatin:

  • Post Training Quantization
    The goal is to quantize the pretrained networks without fine-tuning. To quantize both activations and weights, with the goal that it maintains the accuracy.
  • Quantization aware Training
    To quantize the weights and activations during training, using STE to propagate gradients of quantization functions.

The two common quantization function are follows:

Let’s say Y = WX + B, then W, B and X are all quantized using certain quantization functions to quantize 32-bit floating point numbers to 8-bit integers. We can use different granularities, for examples, per-channel, per-layer, etc. But the result Y is actually integers with more bits than 8, usually it’s 32-bit. To dequantize Y back to floating-point, we need to obtain the corresponding quantization parameters alpha and beta (as shown in the figures). This can be done by sampling some values from Y and calculating them as an approximation.