Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally efficient DNNs. Recently trending formats include block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per block. Prior work mostly focuses on deploying such block-scaled GeMMs on custom accelerators for high efficiency, at the cost of flexibility and ease of deployment. In this work, we instead focus on optimizing flexible block-scaled GeMMs on programmable vector processors using commercially available vector instruction sets (ARM SVE). The programming models are also vector-length agnostic which reduces development effort for different deployment explorations. We introduce efficient intrinsics-based GeMM microkernels including fused requantization to maximize kernel performance. Additionally, we perform further optimizations by utilizing 2D block shapes in addition to conventional 1D blocks, and demonstrate its impact against other baseline implementations. Lastly, we present accuracy-speedup tradeoffs for various block-scaled GeMM configurations in evaluating DNN inference and training workloads.
20/9/2024 11:00 - 12:00
ESAT B91.200