Efficient execution of irregular data flow graphs: Hardware/software co-optimization for probabilistic AI and sparse triangular systems

Marian Verhelst Hardware-efficient AI and ML

Introduction / Objective

To meet the ever-present demand for smarter and more intelligent machines, increasing research efforts are focussed on developing novel artificial intelligence (AI) models. However, despite the promising algorithmic properties, many novel models do not compute well on existing hardware architectures like GPU and neural network processors. A salient example of such a class of models is Probabilistic Circuits (PC) used for neuro-symbolic AI, which requires sparse and irregular graph-based challenging computational patterns. This project takes on this challenge by developing a hardware/software co-optimized computation stack, enabling energy-constrained edge applications.

 

Research Methodology

To address the execution bottlenecks of PCs (and similar irregular data flow graphs in general), several contributions are made across the hardware/software stack as follows:

Application: The most suitable data representation is identified by developing analytical error and energy models of customized fixed and floating-point formats. A novel representation based on the posit format is also investigated.
Compilation: Optimized mapping algorithms are developed to parallelize the workloads on general-purpose multithreaded CPU and dedicated hardware architectures by minimizing synchronization/communication overheads.
Hardware: Two versions of dedicated DAG Processing Units (DPUs) are developed incorporating a dedicated spatial datapath, targeted interconnection network, precision-scalable arithmetic unit, and custom memory hierarchy.
Implementation: The hardware innovations are realized and validated by the optimized physical implementation of the first version of DPU on chip in a 28nm CMOS technology.

 

Results & Conclusions

The cohesive hardware/software optimizations achieve higher throughput than CPU and GPU, while operating at order of magnitude higher energy efficency. The main findings can be summarized as follows:

• An 8b posit can be customized to reach the same accuracy as the 32b float for PCs.
• Optimized mapping algorithms achieve a speed of 2× for multithreaded CPU execution.
• The 28nm DPU prototype achieves a speedup of 5× and 20× over CPU and GPU, while operating below 0.25W.

These results demonstrate that the project contributes important pieces enabling efficient execution of PC and similar irregular data flow graph-based workloads.

Get in touch
Marian Verhelst
Academic staff

Publications about this research topic

Articles in international journals

  1. N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert and M. Verhelst, "DPU: DAG Processing Unit for Irregular Graphs With Precision-Scalable Posit Arithmetic in 28 nm," in IEEE Journal of Solid-State Circuits (JSSC), vol. 57, no. 8, pp. 2586-2596, August 2022.
  2. N. Shah, W. Meert and M. Verhelst, "GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Graphs," in IEEE Transactions on Parallel and Distributed Systems (TDPS), vol. 33, no. 12, pp. 3321-3332, December 2022.

Articles in international conference proceedings

  1. N. Shah, W. Meert and M. Verhelst, "DPU-v2: Energy-efficient execution of irregular directed acyclic graphs," 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 1288-1307.
  2. N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert and M. Verhelst, "9.4 PIU: A 248GOPS/W Stream-Based Processor for Irregular Probabilistic Inference Networks Using Precision-Scalable Posit Arithmetic in 28nm," 2021 IEEE International Solid- State Circuits Conference (ISSCC), 2021, pp. 150-152.
  3. N. Shah, L. I. G. Olascoaga, W. Meert and M. Verhelst, "PROBLP: A framework for low-precision probabilistic inference," 2019 56th ACM/IEEE Design Automation Conference (DAC), 2019, pp. 1-6.
  4. N. Shah, L. I. G. Olascoaga, W. Meert and M. Verhelst, "Acceleration of probabilistic reasoning through custom processor architecture," 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2020, pp. 322-325.
  5. L. I. G. Olascoaga, W. Meert, N. Shah, M. Verhelst and G. Van den Broeck, "Towards hardware-aware tractable learning of probabilistic models," Advances in Neural Information Processing Systems 32 (NeuRIPS), 2019.
  6. S. Zhao, N. Shah, W. Meert and M. Verhelst, "Discrete Samplers for Approximate Inference in Probabilistic Machine Learning," 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022, pp. 1221-1226.
  7. L. I. G. Olascoaga, W. Meert, N. Shah, G. Van den Broeck and M. Verhelst, "Discriminative bias for learning probabilistic sentential decision diagrams," International Symposium on Intelligent Data Analysis (IDA), pp. 184-196. Springer, Cham, 2020.

Other research topics in Hardware-efficient AI and ML

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Hardware-efficient AI and ML
Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst
Massive parallelism for combinatorial optimisation problems
Hardware-efficient AI and ML
Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst
Carbon-aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Integer GEMM Accelerator for SNAX
Hardware-efficient AI and ML
Xiaoling Yi | Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.