High-Speed Power-Efficient Coarse-Grained Convolver Architecture Using Depth-First Compression Scheme

Abstract

Convolutional neural networks (CNNs) have been playing an important role in various applications, e.g., computer vision. Since CNN computations require numerous multiply-accumulate (MAC) operations, how to get them done efficiently is a crucial issue for CNN hardware accelerators. In this paper, we propose a high-speed power-efficient convolver architecture for CNN acceleration. A 3×3 convolver is asked to produce an output every cycle and is commonly accomplished by summing up the results of nine parallel multiplications, which requires ten carry-propagation adders (CPAs) in total. However, the proposed coarse-grained convolver can break the boundary between multipliers and reduce all partial products in a more global way. Consequently, it requires only one CPA to generate the final outcome. It also features a globally delay-optimized partial product reduction tree and a depth-first compression scheme for both area and power minimization. The proposed convolver has been implemented using TSMC 40nm technology. Compared to a conventional 3×3 convolver baseline design, our design can reduce area and power by 15.8% and 26.5% respectively at the clock rate of 1GHz.