Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU

Giuseppe Sarda , Marian Verhelst

Hardware-efficient AI and ML

Vortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPU's. Similarly to the push originating from the RISC-V movement for CPUs, Vortex can enable a myriad of fresh research directions for GPUs. However, as a young hardware platform, it lacks the performance competitiveness necessary for wide adoption.

Particularly, Vortex underperforms for regular, memory-intensive kernels like linear algebra routines, which form the basis of many applications, including Machine Learning. For such kernels, we identified the control flow management overhead and memory orchestration as the main causes of performance degradation on this Vortex GPGPU platform.

To overcome these problems, this research proposes:

A hardware control manager to accelerate branching and predication in regular loop execution.
Decoupled memory streaming lanes to further hide memory latency with useful computation.

The evaluation results for different kernels showed 8 times faster execution, 10 times reduction in dynamic instruction count, and performance improvement from 0.35 to 1.63 GFLOP/s/mm2.

These enhancements can be integrated into application-level libraries in the future, to unleash Vortex as a competitive open-source GPGPU platform for the next generation of Machine Learning.

Get in touch

Phd student

Marian Verhelst

Academic staff

System level integration and extension units interaction with the GPGPU pipeline. The added Control Flow Manager in placed in the issue stage, while the Decoupled Memory Streaming Lanes are in the issue stage.

System level integration and extension units interaction with the GPGPU pipeline. The added Control Flow Manager in placed in the issue stage, while the Decoupled Memory Streaming Lanes are in the issue stage.

Other research topics in Hardware-efficient AI and ML

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

Hardware-efficient AI and ML

Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst

Massive parallelism for combinatorial optimisation problems

Hardware-efficient AI and ML

Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst

Carbon-aware Design Space Exploration for AI Accelerators

Hardware-efficient AI and ML

Jiacong Sun | Georges Gielen and Marian Verhelst

Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Hardware-efficient AI and ML

Jun Yin | Marian Verhelst

A Scalable Heterogenous Multi-accelerator Platform for AI and ML

Hardware-efficient AI and ML

Ryan Antonio | Marian Verhelst

BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration

Hardware-efficient AI and ML

Man Shi | Marian Verhelst

Uncertainty-Aware Design Space Exploration for AI Accelerators

Hardware-efficient AI and ML

Jiacong Sun | Georges Gielen and Marian Verhelst

Integer GEMM Accelerator for SNAX

Hardware-efficient AI and ML

Xiaoling Yi | Marian Verhelst

Improving GPGPU micro architecture for future AI workloads

Hardware-efficient AI and ML

Giuseppe Sarda | Marian Verhelst

SRAM based digital in memory compute macro in 16nm

Hardware-efficient AI and ML

Weijie Jiang | Wim Dehaene

Scalable large array nanopore readouts for proteomics and next-generation sequencing

Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces

Sander Crols | Filip Tavernier and Marian Verhelst

Hardware-algorithm Co-design and Accelerator Architecture Exploration for hybrid DNN and DSP Workloads

Hardware-efficient AI and ML

Jun Yin | Marian Verhelst

Design space exploration of in-memory computing DNN accelerators

Hardware-efficient AI and ML

Pouya Houshmand and Jiacong Sun | Marian Verhelst

Multi-core architecture exploration for layer-fused deep learning acceleration

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

HW-algorithm co-design for Bayesian inference of probabilistic machine learning

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Shirui Zhao | Marian Verhelst

Design space exploration for machine learning acceleration

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

Efficient execution of irregular data flow graphs: Hardware/software co-optimization for probabilistic AI and sparse triangular systems

Hardware-efficient AI and ML

Marian Verhelst

Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Hardware-efficient AI and ML

Man Shi | Marian Verhelst

Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Josse Van Delm | Marian Verhelst

High-throughput high-efficiency SRAM for neural networks

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Wim Dehaene and Marian Verhelst

Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge

Hardware-efficient AI and ML

Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.

Discover how we can collaborate