Hardware-Efficient Softmax Approximation for Self-Attention Networks

Abstract

Self-attention networks have become state-of-the-art models for natural language processing (NLP) problems. Softmax function turns out to be a severe throughput and latency bottleneck that takes up considerable run-time of the overall Transformer networks. In this paper, we proposed a hardware efficient Softmax approximation which can be used as a direct plug-in substitution into pretrained transformer network to accelerate NLP tasks without compromising its accuracy. Experiment results on FPGA implementation show that our design outperforms vanilla Softmax designed using Xilinx IPs with 15x less LUTs, 55x less registers and 23x lower latency at similar clock frequency and less than 1% accuracy drop on main language benchmark tasks. We also propose a pruning method to reduce the input entropy of Softmax for NLP problems with high number of inputs. It was validated on CoLA task to achieve a further 25% reduction of latency.