Hardware-efficient AI and ML

Machine learning and artificial intelligence solutions are more and more omnipresent in today’s society. They enable unseen capabilities in robotics, smart appliances, autonomous vehicles or wearables. Many traditional signal processing tasks, such as speech denoiseing, or image segmentation are increasingly replaced by data-driven ML techniques. Traditionally, these training and inference workloads ran in the cloud, where powerful compute servers and abundant memory resources are available. Yet, we recently see a rapid shift towards edge and extreme edge processing of machine intelligence workloads. This opens up a new class of devices, also denoted by “edge AI” or “tinyML”. Over the last decade, the research team at MICAS has been exploring improved hardware architectures, chip implementation and hardware-algorithm co-optimization techniques for hardware-efficient AI solutions.

icon

Research challenges

The impressive progress in this field comes with drastic increases in model sizes and complexities. As such, enabling powerful ML algorithms in a constrained memory, latency and/or energy budget comes with several exciting challenges.  Execution efficiency can be obtained by customizing processor architectures to the models of interest. Yet, the speed at which new models emerge, impede such tight co-optimization, and require the hardware platforms to be flexible towards future developments. The challenge is hence to strike the right balance between customization and flexibility. Our MICAS team continued to work on several innovations towards this goal.

 

Multi-core ML platforms and custom compilation infrastructure

New processor architectures have to be developed to accelerate the targeted workloads. Existing CPU's and GPU's fail to achieve sufficient efficiency. New NPU (neural processing units), TPU (tensor processing units) or IMC (in-memory computing) designs are developed, and offer significant speed ups. Yet, we are at a point where single core solutions no longer suffice. New multi-accelerator systems have to be explored.

Our vision to achieve efficient execution, for a multitude of diverse ML workloads, is to combine different accelerator cores in heterogeneous multi-core processing platforms. The Diana platform, taped out in 2021, was the first heterogeneous multicore system developed in our lab – combining a RICV-V CPU, a digital AI accelerator and an analog-in-memory AI accelerator. In 2023, we continued with the design of various AI accelerators for bit-sparse DNN inference and for evaluating emerging probabilitsic graphical models. In 2024, we focus our efforts on a RISC-V based processor architeture template, denoted as "SNAX", enabling the easy integration of a wide variety of ML-accelerators in a RISC-V framework.

In parallel, we are developing integrated compile flows, which allow to smoothly customize for heterogeneous platforms consisting of a diverse mix of accelerators. A first flow based on TVM, call "HTVM", has been rolled out and been deployed for the Diana and GAP9 chips. Currently, the flow is migrated to MLIR, to enabling increased flexibility and customization.

 

Design/mapping space exploration multi-accelerator platforms: ZigZag and Stream

The degrees of freedom in designing such ML accelerators are very large. It is time-wise impossible to develop each of them at RTL level to assess their relative performance. When migrating from single-core to multi-accelerator heterogeneous systems, the design space as well as the scheduling or mapping space again increases drastically. Moreover, the optimal hardware architecture is tightly interwoven with the optimal execution schedule when mapping different workloads on the hardware, requiring co-optimization. To enable this, a rapid modeling and design/scheduling space exploration (DSE) frameworks are developed at MICAS, called ZigZag (for single core) and Stream (for multi-core). ZigZag and Stream are available open source, and is continuously expanded by our team. In 2023, our tool suite was extended with ZigZag-IMC to also model in-memory computing architectures.

All frameworks are available fully open-source on github, using the links in the text above.

 

Kick-starting the Bingo ERC project!

In 2023, Prof. Verhelst’s Bingo ERC project launched. Bingo tackles the problem of the discrepancy between the slow development cycle of processor chips (many months to years) and the high-pace evolutions of ML algorithms (hours to weeks). This bottleneck is also known as the “hardware lottery”, and holds back innovation, severely impacts embedded AI execution efficiency, and narrows the market to a few large companies. The BINGO vision to break this innovation deadlock is to enable heterogeneous compute platform customization for a given AI workload in a matter of days (100x faster), through rapid selection and assembly of prefabricated co-processor chiplets. A new team at MICAS will enable that vision in the coming 5 years of the Bingo project.

 

Current research topics

Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Integer GEMM Accelerator for SNAX
Hardware-efficient AI and ML
Xiaoling Yi | Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Pouya Houshmand and Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Innovative chips

128KB high density digital in memory compute macro in 16nm FF
Technology: 16nm FF, TSMC
Published: ESSCIRC 2023
Application: Digital in memory compute for AI applications

Top publications

Get in touch with our lead researchers

Interested in working together?

Wim Dehaene
Wim Dehaene
Academic staff
Georges Gielen
Georges Gielen
Academic staff