Quantization

Posted on 2024-07-03 Edited on 2024-12-28 In Paper reading Views:

Tutorial: https://www.youtube.com/watch?v=0VdNflU08yA

There wo main types of network quantizatin:

Post Training Quantization
The goal is to quantize the pretrained networks without fine-tuning. To quantize both activations and weights, with the goal that it maintains the accuracy.
Quantization aware Training
To quantize the weights and activations during training, using STE to propagate gradients of quantization functions.

The two common quantization function are follows:

Let’s say Y = WX + B, then W, B and X are all quantized using certain quantization functions to quantize 32-bit floating point numbers to 8-bit integers. We can use different granularities, for examples, per-channel, per-layer, etc. But the result Y is actually integers with more bits than 8, usually it’s 32-bit. To dequantize Y back to floating-point, we need to obtain the corresponding quantization parameters alpha and beta (as shown in the figures). This can be done by sampling some values from Y and calculating them as an approximation.