Quantization
Tutorial: https://www.youtube.com/watch?v=0VdNflU08yA
There wo main types of network quantizatin:
- Post Training Quantization
The goal is to quantize the pretrained networks without fine-tuning. To quantize both activations and weights, with the goal that it maintains the accuracy. - Quantization aware Training
To quantize the weights and activations during training, using STE to propagate gradients of quantization functions.
The two common quantization function are follows:
Let’s say Y = WX + B, then W, B and X are all quantized using certain quantization functions to quantize 32-bit floating point numbers to 8-bit integers. We can use different granularities, for examples, per-channel, per-layer, etc. But the result Y is actually integers with more bits than 8, usually it’s 32-bit. To dequantize Y back to floating-point, we need to obtain the corresponding quantization parameters alpha and beta (as shown in the figures). This can be done by sampling some values from Y and calculating them as an approximation.