Skip to main content
Video s3
    Details
    Author(s)
    Display Name
    Nazim Altar Koca
    Affiliation
    Affiliation
    Nanyang Technological University
    Display Name
    Anh Tuan Do
    Affiliation
    Affiliation
    Agency for Science, Technology and Research
    Display Name
    Chip Hong Chang
    Affiliation
    Affiliation
    Nanyang Technological University
    Abstract

    Self-attention networks have become state-of-the-art models for natural language processing (NLP) problems. Softmax function turns out to be a severe throughput and latency bottleneck that takes up considerable run-time of the overall Transformer networks. In this paper, we proposed a hardware efficient Softmax approximation which can be used as a direct plug-in substitution into pretrained transformer network to accelerate NLP tasks without compromising its accuracy. Experiment results on FPGA implementation show that our design outperforms vanilla Softmax designed using Xilinx IPs with 15x less LUTs, 55x less registers and 23x lower latency at similar clock frequency and less than 1% accuracy drop on main language benchmark tasks. We also propose a pruning method to reduce the input entropy of Softmax for NLP problems with high number of inputs. It was validated on CoLA task to achieve a further 25% reduction of latency.