Integer GEMM Accelerator for SNAX

Xiaoling Yi , Marian Verhelst Hardware-efficient AI and ML

Research goal: Matrix multiplication is a routine heavily used in Artificial Intelligence workloads. This project focuses on accelerating the matrix multiplication with a specialized GEMM accelerator. This accelerator is compliant with the SNAX  accelerator RISC-V manager core to build an efficient heterogeneous system for emerging Artificial Intelligence workloads. It is written in CHISEL 5.0.0 with many design-time configuration parameters and connected to the SNAX core through SystemVerilog. 

Microarchitecture Description: The GEMM Accelerator is available in three versions: Base GEMM, Block GEMM, and Batch GEMM. The Base GEMM is the datapath of this GEMM Accelerator, which is a Mesh of computation Tiles. A 2D tile data is broadcasted to each compute Tile. Each compute Tile is a hardware module implementing the dot product of two vectors. The Mesh size, the Tile size, and the datatype can all be configured at design time to accommodate different application scenarios. The Block GEMM temporally unrolls a block matrix multiplication (Base GEMM) in hardware so that it can support the matrix multiplication of any matrix size. The Batch GEMM further temporally unrolls a batch of matrix multiplications (Block GEMM) in the hardware to save configuration cycles.

Recent results for GEMM Accelerator: The preliminary experimental results show that the GEMM Accelerator has 500x performance compared with the RISC-V Snitch core. If no data contention, the GEMM Accelerator achieves 96.24% utilization when computing a (32,64) * (64,32) matrix multiplication under the current hardware configuration.

The GEMM Accelerator is available open source at: https://github.com/KULeuven-MICAS/snax-gemm.  The documentation can be found at: https://github.com/KULeuven-MICAS/snax-gemm/blob/main/README.md.

Get in touch
Xiaoling Yi
Phd student
Marian Verhelst
Academic staff
Microarchitecture of GEMM Accelerator
Microarchitecture of GEMM Accelerator

Publications about this research topic

The paper is accepted to ASPDAC2025 and can also be found at https://arxiv.org/abs/2411.09543

Other research topics in Hardware-efficient AI and ML

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Hardware-efficient AI and ML
Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst
Massive parallelism for combinatorial optimisation problems
Hardware-efficient AI and ML
Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst
Carbon-aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.