Skip to main content
Video s3
    Details
    Presenter(s)
    Fang Chao Headshot
    Display Name
    Fang Chao
    Affiliation
    Affiliation
    Nanjing University
    Country
    Author(s)
    Display Name
    Fang Chao
    Affiliation
    Affiliation
    Nanjing University
    Display Name
    Shouliang Guo
    Affiliation
    Affiliation
    Nanjing University
    Display Name
    Wei Wu
    Affiliation
    Affiliation
    Nanjing University
    Display Name
    Jun Lin
    Affiliation
    Affiliation
    Nanjing University
    Display Name
    Zhongfeng Wang
    Affiliation
    Affiliation
    Nanjing University, China
    Display Name
    Ming Kai Hsu
    Affiliation
    Affiliation
    Kuaishou Technology
    Display Name
    Lingzhi Liu
    Affiliation
    Affiliation
    Kuaishou Technology
    Abstract

    Transformers have been an indispensable staple in deep learning. However, it is challenging to realize efficient deployment for Transformer-based model due to its substantial computation and memory demands. To address this issue, we present an efficient sparse Transformer accelerator on FPGA, namely STA, by exploiting N:M fine-grained structured sparsity. Our design features not only a unified computing engine capable of performing both sparse and dense matrix multiplications with high computational efficiency, but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that our implementation achieves the lowest latency compared to CPU, GPU, and prior FPGA-based accelerators. Moreover, compared with the state-of-the-art FPGA-based accelerators, it can achieve up to 12.28× and 51.00× improvement on energy efficiency and MAC efficiency, respectively.

    Slides
    • An Efficient Hardware Accelerator for Sparse Transformer Neural Networks (application/pdf)