Efficient Quantization and Multi-Precision Design of Arithmetic Components for Deep Learning

Abstract

We present a quantization algorithm to find the per-layer bit-width in deep neural networks (DNN) models and then propose multi-precision designs for two fundamental DNN arithmetic operations: multiplication and non-linear activation function computation. The multi-precision multiplier design with truncated partial product bits supports four precision modes (4-bit, 8-bit, 12-bit, and 16-bit) with shared hardware resource so that the power consumption in low-precision modes can be reduced by turning off unnecessary hardware circuits. For the evaluation of non-linear activation functions, we compare three different approaches in various precision requirements and observe that the best design method depends on the bit-accuracy.