Hardware-efficient AI and ML

Machine learning and artificial intelligence (AI) solutions are increasingly pervasive in modern society. Cloud-based smart AI assistants are revolutionizing the way we work, learn, and communicate. At the same time, advances in AI at the edge are unlocking unprecedented capabilities in robotics, smart appliances, autonomous vehicles, and wearable devices. However, these AI training and inference tasks impose substantial computational, energy, memory, and carbon footprints. Over the past decade, the scale of these workloads has moreover grown at an extraordinary pace, surpassing even the projections of Moore's law. As a result, fundamental hardware and architecture research is required to sustain AI's transformative impact. At MICAS, our research team has spent the last decade addressing these challenges by exploring improved hardware architectures, advanced chip implementations, and hardware-algorithm co-optimization techniques for hardware-efficient AI solutions.

Research challenges

Enabling powerful ML algorithms in a constrained memory, latency, energy and/or carbon budget comes with several exciting challenges. Execution efficiency can be obtained by customizing processor architectures to the models of interest. Yet, the speed at which new models emerge, impede such tight co-optimization, and require the hardware platforms to be flexible towards future developments. The challenge is hence to strike the right balance between customization and flexibility. Our MICAS team continued to work on several innovations towards this goal.

Multi-core ML platforms and custom compilation infrastructure

New processor architectures have to be developed to accelerate the targeted workloads. Existing CPU's and GPU's fail to achieve sufficient efficiency. New NPU (neural processing units), TPU (tensor processing units) or IMC (in-memory computing) designs are developed, and offer significant speed ups. Yet, we are at a point where single core solutions no longer suffice. New multi-accelerator systems have to be explored.

Our vision to achieve efficient execution, for a multitude of diverse ML workloads, is to combine different accelerator cores in heterogeneous multi-core processing platforms. The Diana platform, taped out in 2021, was the first heterogeneous multicore system developed in our lab – combining a RICV-V CPU, a digital AI accelerator and an analog-in-memory AI accelerator. In 2023, we continued with the design of various AI accelerators for bit-sparse DNN inference and for evaluating emerging probabilitsic graphical models. Since 2024, we focus our efforts on a RISC-V based processor architeture template, denoted as "SNAX", enabling the easy integration of a wide variety of ML-accelerators in a RISC-V framework.

In parallel, we are developing integrated compile flows, which allow to smoothly customize for heterogeneous platforms consisting of a diverse mix of accelerators. A first flow based on TVM, call "HTVM", has been rolled out and been deployed for the Diana and GAP9 chips. Currently, the flow is migrated to MLIR, to enabling increased flexibility and customization for multi-accelerator SNAX platforms.

SNAX Cluster Architecture

Design/mapping space exploration multi-accelerator platforms: ZigZag and Stream

The degrees of freedom in designing such ML accelerators are very large. It is time-wise impossible to develop each of them at RTL level to assess their relative performance. When migrating from single-core to multi-accelerator heterogeneous systems, the design space as well as the scheduling or mapping space again increases drastically. Moreover, the optimal hardware architecture is tightly interwoven with the optimal execution schedule when mapping different workloads on the hardware, requiring co-optimization. To enable this, a rapid modeling and design/scheduling space exploration (DSE) frameworks are developed at MICAS, called ZigZag (for single core) and Stream (for multi-core). ZigZag and Stream are available open source, and is continuously expanded by our team. In 2024, our tool suite was extended with an extension for Large Language Models (ZigZag-LLM), a carbon estimation model (in main ZigZag branch), as well as a stochastic framework for sparse AI processors.

All frameworks are available fully open-source on github, using the links in the text above.

Expanding to non-neural workloads!

While neural networks continue to thrive, it becomes more and more clear that we will need a broader variety of models for the capable, secure, reliable, efficient AI models of the future. Neural networks excel at handling complex, high-dimensional data, offering scalability and flexibility for diverse applications. However, they often struggle with interpretability and uncertainty handling. Probabilistic models address this gap by incorporating robustness to uncertainty and providing confidence measures, but they can be computationally intensive. Symbolic reasoning, on the other hand, brings interpretability, and allows to insert expert knowledge through structured logic and rules, which are essential for tasks requiring transparency and explicit decision-making.

However, current hardware architectures are optimized for the mainstream neural network execution, while probabilistic or symbolic reasoning algorothms do not map well on existing CPU, GPU or TPU platforms. At MICAS, we are actively researching computer architectures for these novel workloads, aiming at one platform which can support hybrid mixes of these different workloads, blending high-throughput matrix operations with sparse computations, stochastic processes, and graph-based reasoning. These demands call for innovative architectures that combine specialized accelerators, memory hierarchies, and co-optimized software to handle the diverse and dynamic requirements of this next generation of AI.

Current research topics

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

Hardware-efficient AI and ML

Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst

Massive parallelism for combinatorial optimisation problems

Hardware-efficient AI and ML

Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst

Carbon-aware Design Space Exploration for AI Accelerators

Hardware-efficient AI and ML

Jiacong Sun | Georges Gielen and Marian Verhelst

Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU

Hardware-efficient AI and ML

Giuseppe Sarda | Marian Verhelst

Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Hardware-efficient AI and ML

Jun Yin | Marian Verhelst

A Scalable Heterogenous Multi-accelerator Platform for AI and ML

Hardware-efficient AI and ML

Ryan Antonio | Marian Verhelst

BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration

Hardware-efficient AI and ML

Man Shi | Marian Verhelst

Uncertainty-Aware Design Space Exploration for AI Accelerators

Hardware-efficient AI and ML

Jiacong Sun | Georges Gielen and Marian Verhelst

Integer GEMM Accelerator for SNAX

Hardware-efficient AI and ML

Xiaoling Yi | Marian Verhelst

Improving GPGPU micro architecture for future AI workloads

Hardware-efficient AI and ML

Giuseppe Sarda | Marian Verhelst

SRAM based digital in memory compute macro in 16nm

Hardware-efficient AI and ML

Weijie Jiang | Wim Dehaene

Scalable large array nanopore readouts for proteomics and next-generation sequencing

Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces

Sander Crols | Filip Tavernier and Marian Verhelst

Hardware-algorithm Co-design and Accelerator Architecture Exploration for hybrid DNN and DSP Workloads

Hardware-efficient AI and ML

Jun Yin | Marian Verhelst

Design space exploration of in-memory computing DNN accelerators

Hardware-efficient AI and ML

Pouya Houshmand and Jiacong Sun | Marian Verhelst

Multi-core architecture exploration for layer-fused deep learning acceleration

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

HW-algorithm co-design for Bayesian inference of probabilistic machine learning

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Shirui Zhao | Marian Verhelst

Design space exploration for machine learning acceleration

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

Efficient execution of irregular data flow graphs: Hardware/software co-optimization for probabilistic AI and sparse triangular systems

Hardware-efficient AI and ML

Marian Verhelst

Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Hardware-efficient AI and ML

Man Shi | Marian Verhelst

Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators

Hardware-efficient AI and ML

Arne Symons | Marian Verhelst

Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Josse Van Delm | Marian Verhelst

High-throughput high-efficiency SRAM for neural networks

Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML

Wim Dehaene and Marian Verhelst

Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge

Hardware-efficient AI and ML

Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Innovative chips

16nm Digital-In-Memory-Compute SoC for edge CNN inference

Technology: 16nm FinFET

Published: ESSERC 2024

Application: Edge CNN inference

AIA: A 16nm Multicore SoC for Approximate Inference Acceleration Exploiting Non-normalized Knuth-Yao Sampling and Inter-Core Register Sharing

Technology: Intel 16nm

Published: ESSERC

Application: Machine learning

128KB high density digital in memory compute macro in 16nm FF

Technology: 16nm FF, TSMC

Published: ESSCIRC 2023

Application: Digital in memory compute for AI applications

Paper: A 16nm 128kB high-density fully digital In Memory Compute macro with reverse SRAM pre-charge achieving 0.36TOPs/mm2, 256kB/mm2 and 23. 8TOPs/W

DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge

Technology: 22nm FDX

Published: ISSCC 2022, JSSC 2022

Application: CNN accelerations

Paper: DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge

Discover more chips

Top publications

ML Processors Are Going Multi-Core: A performance dream or a scheduling nightmare? Verhelst, Marian, Man Shi, and Linyan Mei. IEEE Solid-State Circuits Magazine 14, no. 4 (2022): 18-27.
HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms, Van Delm, Josse, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, and Marian Verhelst. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pp. 1-6. IEEE, 2023
Stream: Design Space Exploration of Layer-fused DNNs on Heterogeneous Dataflow Accelerators, A Symons, L Mei, S Colleman, P Houshmand, S Karl, M Verhelst, in IEEE Transactions on Computers, 2024
A 16nm 128kB high-density fully digital In Memory Compute macro with reverse SRAM pre-charge achieving 0.36 TOPs/mm 2, 256kB/mm 2 and 23. 8TOPs/W, Jiang, Weijie, Pouya Houshmand, Marian Verhelst, and Wim Dehaene. In ESSCIRC 2023-IEEE 49th European Solid State Circuits Conference (ESSCIRC), pp. 409-412. IEEE, 2023.
DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge; P Houshmand, GM Sarda, V Jain, K Ueyoshi, IA Papistas, M Shi, Q Zheng, ..., M Verhelst; IEEE Journal of Solid-State Circuits 2022.
TinyVers: A 0.8-17 TOPS/W, 1.7 μW-20 mW, Tiny Versatile System-on-chip with State-Retentive eMRAM for Machine Learning Inference at the Extreme Edge; V Jain, S Giraldo, J De Roose, B Boons, L Mei, M Verhelst
2022 IEEE Symposium on VLSI Technology and Circuits
DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference K Goetschalckx, F Wu, M Verhelst; IEEE Journal of Solid-State Circuits 2022
DPU: DAG Processing Unit for Irregular Graphs With Precision-Scalable Posit Arithmetic in 28 nm; N Shah, LIG Olascoaga, S Zhao, W Meert, M Verhelst
IEEE Journal of Solid-State Circuits
GRAPHOPT: constrained-optimization-based parallelization of irregular graphs N Shah, W Meert, M Verhelst
IEEE Transactions on Parallel and Distributed Systems 2022

Discover more publications

Get in touch with our lead researchers

Interested in working together?

Academic staff

Academic staff

Academic staff