Skip to main content
Video s3
    Details
    Presenter(s)
    Stewart Denholm Headshot
    Display Name
    Stewart Denholm
    Affiliation
    Affiliation
    Imperial College London
    Country
    Author(s)
    Display Name
    Stewart Denholm
    Affiliation
    Affiliation
    Imperial College London
    Display Name
    Wayne Luk
    Affiliation
    Affiliation
    Imperial College
    Abstract

    Memory-based computing stores pre-computed function results in memory to be read at runtime. FPGAs group together multiple block memories (BRAMs) to form this memory, all accessed as a single monolithic device. We introduce a novel ring-based architecture to leverage parallel accesses to these constituent BRAMs, benefiting low latency applications that rely on: highly-complex functions; numerical precision via iterative computation; or many parallel data-paths accessing a shared memory resource. The implemented function\'s performance is independent of its complexity, enabling significant latency reductions for compute-bound operations. We assess common functions (sqrt, power, trigonometric, hyperbolic functions) on the Xilinx Alveo U280 FPGA. Our function-agnostic memory-compute core can serve 1024 parallel function calls at 300MHz and reduce latency 4.4-29x versus traditional FPGA implementations.

    Slides
    • Maximising Parallel Memory Access for Low Latency FPGA Designs (application/pdf)