Maximising Parallel Memory Access for Low Latency FPGA Designs

Presenter(s)

Stewart Denholm

Affiliation: Affiliation

Imperial College London
Country

View profile

Author(s)

Stewart Denholm

Affiliation: Affiliation

Imperial College London

View profile

Wayne Luk

Affiliation: Affiliation

Imperial College

View profile

Abstract

Memory-based computing stores pre-computed function results in memory to be read at runtime. FPGAs group together multiple block memories (BRAMs) to form this memory, all accessed as a single monolithic device. We introduce a novel ring-based architecture to leverage parallel accesses to these constituent BRAMs, benefiting low latency applications that rely on: highly-complex functions; numerical precision via iterative computation; or many parallel data-paths accessing a shared memory resource. The implemented function\'s performance is independent of its complexity, enabling significant latency reductions for compute-bound operations. We assess common functions (sqrt, power, trigonometric, hyperbolic functions) on the Xilinx Alveo U280 FPGA. Our function-agnostic memory-compute core can serve 1024 parallel function calls at 300MHz and reduce latency 4.4-29x versus traditional FPGA implementations.