LOMA: Fast Auto-Scheduling on DNN Accelerators Through Loop-Order-Based Memory Allocation

Abstract

The scheduling or temporal mapping of a neural network (NN) on a given hardware (HW) accelerator strongly impacts its execution energy and latency. Unfortunately the mapping space is huge and varies a lot in function of the NN-HW combination. Many design space exploration (DSE) frameworks aim at automatically exploring this vast mapping space. Yet, SotA frameworks suffer from being slow (e.g. exhaustive search), inflexible across a wide range of HW architectures (e.g. no support for uneven mapping), or cannot guarantee global optimality (e.g. either from relying on user-defined constraints or on random sampling). Moreover, existing frameworks are typically unable to predict required CPU run-time and peak CPU memory requirements in advance and as such unable to trade-off search time with optimality in a deterministic manner. This work proposes LOMA, a fast auto-scheduling methodology through loop-order-based memory allocation, which overcomes above bottlenecks. LOMA's capabilities are demonstrated at scale in finding the optimal schedule of complete MobileNetV3, resp. NASNet.