Improving System Latency of AI Accelerator with On-Chip Pipelined Activation Preprocessing and Multi-Mode Batch Inference

Abstract

In this work, we observe one of the key factors for low utilization of processing elements (PEs) is the sparsity of input activations caused by the default DRAM bandwidth utilization (BU) of only 18.75%. To optimize this, we design two on-chip preprocessing modules that merge up-to-five input frames in one DRAM transfer to increase BU to 93.75% and facilitates parallel img2col buffering. Furthermore, the merged activations are simultaneously processed by the neuron engine to reduce the system latency and significantly increases PE utilization. Explicitly, multiple batch inferencing modes including Intra-PE-, temporal- and spatial-sharing are proposed to pipeline with the preprocessing modules, resulting in a 65%$sim$75% reduction of system latency. The on-chip preprocessing modules incur the physical overheads of 31.0% in area and 28.7% in power consumption under 40nm SMIC standard cell library.