Let’s start with a very simple example and build it up.

## Example-1: Symmetric uint8 quantization

Let’s say we wish to map the floating point range [0.0 .. 1000.0] to the quantized range [0 .. 255]. The range [0 .. 255] is the set of values that can fit in an unsigned 8-bit integer.

To perform this transformation, we want to rescale the floating point range so that the following is true:

Floating point 0.0 = Quantized 0

Floating point 1000.0 = Quantized 255

This is called symmetric quantization because the floating point 0.0 is quantized 0.

Hence, we define a scale, which is equal to

Where,

In this case, scale = 3.9215

To convert from a floating point value to a quantized value, we can simply divide the floating point value by the scale. For example, the floating point value 500.0 corresponds to the quantized value

In this simple example, the 0.0 of the floating point range maps exactly to the 0 in the quantized range. This is called symmetric quantization. Let’s see what happens when this is not the case.

## Example-2: Affine uint8 quantization

Let’s say we wish to map the floating point range [-20.0 .. 1000.0] to the quantized range [0 .. 255].

In this case, we have a different scaling factor since our *xmin* is different.

Let’s see what the floating point number 0.0 is represented by in the quantized range if we apply the scaling factor to 0.0

Well, this doesn’t quite seem right since, according to the diagram above, we would have expected the floating point value -20.0 to map to the quantized value 0.

This is where the concept of zero-point comes in. **The zero-point acts as a bias for shifting the scaled floating point value and corresponds to the value in the quantized range that represents the floating point value 0.0.** In our case, the zero point is the negative of the scaled floating point representation of -20.0, which is -(-5) = 5. The zero point is always the negative of the representation of the minimum floating point value since the minimum will always be negative or zero. We’ll find out more about why this is the case in the section that explains example 4.

Whenever we quantize a value, we will always add the zero-point to this scaled value to get the actual quantized value in the valid quantization range. In case we wish to quantize the value -20.0, we compute it as the scaled value of -20.0 plus the zero-point, which is -5 + 5 = 0. Hence, quantized(-20.0, scale=4, zp=5) = 0.

## Example-3: Affine int8 quantization

What happens if our quantized range is a signed 8-bit integer instead of an unsigned 8-bit integer? Well, the range is now [-128 .. 127].

In this case, -20.0 in the float range maps to -128 in the quantized range, and 1000.0 in the float range maps to 127 in the quantized range.

The way we calculate zero point is that we compute it as if the quantized range is [0 .. 255] and then offset it with -128, so the zero point in the new range is

Hence, the zero-point for the new range is -123.

So far, we’ve looked at examples where the floating point range includes the value 0.0. In the next set of examples, we’ll take a look at what happens when the floating point range doesn’t include the value 0.0

## The importance of 0.0

Why is it important for the floating point value 0.0 to be represented in the floating point range?

When using a padded convolution, we expect the border pixels to be padded using the value 0.0 in the most common case. Hence, it’s important for 0.0 to be represented in the floating point range. Similarly, if the value X is going to be used for padding in your network, you need to make sure that the value X is represented in the floating point range and that quantization is aware of this.

## Example-4: The untold story — skewed floating point range

Now, let’s take a look at what happens if 0.0 isn’t part of the floating point range.

In this example, we’re trying to quantize the floating point range [40.0 .. 1000.0] into the quantized range [0 .. 255].

Since we can’t represent the value 0.0 in the floating point range, we need to extend the lower limit of the range to 0.0.

We can see that some part of the quantized range is wasted. To determine how much, let’s compute the quantized value that the floating point value 40.0 maps to.

Hence, we’re wasting the range [0 .. 9] in the quantized range, which is about 3.92% of the range. This could significantly affect the model’s accuracy post-quantization.

This skewing is necessary if we wish to make sure that the value 0.0 in the floating point range can be represented in the quantized range.

Another reason for including the value 0.0 in the floating point range is that efficiently comparing a quantized value to check if it’s 0.0 in the floating point range is very valuable. Think of operators such as ReLU, which clip all values below 0.0 in the floating point range to 0.0.

It is important for us to be able to **represent the zero-point using the same data type** (signed or unsigned int8) **as the quantized values**. This enables us to perform these comparisons quickly and efficiently.

Next, let’s take a look at how activation normalization helps with model quantization. We’ll specifically focus on how the standardization of the activation values allows us to use the entire quantized range effectively.

This post originally appeared on TechToday.