Research goal: Matrix multiplication is a routine heavily used in Artificial Intelligence workloads. This project focuses on accelerating the matrix multiplication with a specialized GEMM accelerator. This accelerator is compliant with the SNAX accelerator RISC-V manager core to build an efficient heterogeneous system for emerging Artificial Intelligence workloads. It is written in CHISEL 5.0.0 with many design-time configuration parameters and connected to the SNAX core through SystemVerilog.
Microarchitecture Description: The GEMM Accelerator is available in three versions: Base GEMM, Block GEMM, and Batch GEMM. The Base GEMM is the datapath of this GEMM Accelerator, which is a Mesh of computation Tiles. A 2D tile data is broadcasted to each compute Tile. Each compute Tile is a hardware module implementing the dot product of two vectors. The Mesh size, the Tile size, and the datatype can all be configured at design time to accommodate different application scenarios. The Block GEMM temporally unrolls a block matrix multiplication (Base GEMM) in hardware so that it can support the matrix multiplication of any matrix size. The Batch GEMM further temporally unrolls a batch of matrix multiplications (Block GEMM) in the hardware to save configuration cycles.
Recent results for GEMM Accelerator: The preliminary experimental results show that the GEMM Accelerator has 500x performance compared with the RISC-V Snitch core. If no data contention, the GEMM Accelerator achieves 96.24% utilization when computing a (32,64) * (64,32) matrix multiplication under the current hardware configuration.
The GEMM Accelerator is available open source at: https://github.com/KULeuven-MICAS/snax-gemm. The documentation can be found at: https://github.com/KULeuven-MICAS/snax-gemm/blob/main/README.md.
The paper is accepted to ASPDAC2025 and can also be found at https://arxiv.org/abs/2411.09543