An Efficient Hardware Accelerator for Sparse Transformer Neural Networks

Transformers have been an indispensable staple in deep learning. However, it is challenging to realize efficient deployment for Transformer-based model due to its substantial computation and memory demands. To address this issue, we present an efficient sparse Transformer accelerator on FPGA, namely STA, by exploiting N:M fine-grained structured sparsity. Our design features not only a unified computing engine capable of performing both sparse and dense matrix multiplications with high computational efficiency, but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that our implementation achieves the lowest latency compared to CPU, GPU, and prior FPGA-based accelerators. Moreover, compared with the state-of-the-art FPGA-based accelerators, it can achieve up to 12.28× and 51.00× improvement on energy efficiency and MAC efficiency, respectively.

Slides

An Efficient Hardware Accelerator for Sparse Transformer Neural Networks (application/pdf)

Download