Machine learning and artificial intelligence solutions are more and more omnipresent in today’s society. They enable unseen capabilities in robotics, smart appliances, autonomous vehicles or wearables. Many traditional signal processing tasks, such as speech denoiseing, or image segmentation are increasingly replaced by data-driven ML techniques. Traditionally, these training and inference workloads ran in the cloud, where powerful compute servers and abundant memory resources are available. Yet, we recently see a rapid shift towards edge and extreme edge processing of machine intelligence workloads. This opens up a new class of devices, also denoted by “edge AI” or “tinyML”. Over the last decade, the research team at MICAS has been exploring improved hardware architectures, chip implementation and hardware-algorithm co-optimization techniques for hardware-efficient AI solutions.
The impressive progress in this field comes with drastic increases in model sizes and complexities. As such, enabling powerful ML algorithms in a constrained memory, latency and/or energy budget comes with several exciting challenges. Execution efficiency can be obtained by customizing processor architectures to the models of interest. Yet, the speed at which new models emerge, impede such tight co-optimization, and require the hardware platforms to be flexible towards future developments. The challenge is hence to strike the right balance between customization and flexibility. Our MICAS team continued to work on several innovations towards this goal.
New processor architectures have to be developed to accelerate the targeted workloads. Existing CPU's and GPU's fail to achieve sufficient efficiency. New NPU (neural processing units), TPU (tensor processing units) or IMC (in-memory computing) designs are developed, and offer significant speed ups. Yet, we are at a point where single core solutions no longer suffice. New multi-accelerator systems have to be explored.
Our vision to achieve efficient execution, for a multitude of diverse ML workloads, is to combine different accelerator cores in heterogeneous multi-core processing platforms. The Diana platform, taped out in 2021, was the first heterogeneous multicore system developed in our lab – combining a RICV-V CPU, a digital AI accelerator and an analog-in-memory AI accelerator. In 2023, we continued with the design of various AI accelerators for bit-sparse DNN inference and for evaluating emerging probabilitsic graphical models. In 2024, we focus our efforts on a RISC-V based processor architeture template, denoted as "SNAX", enabling the easy integration of a wide variety of ML-accelerators in a RISC-V framework.
In parallel, we are developing integrated compile flows, which allow to smoothly customize for heterogeneous platforms consisting of a diverse mix of accelerators. A first flow based on TVM, call "HTVM", has been rolled out and been deployed for the Diana and GAP9 chips. Currently, the flow is migrated to MLIR, to enabling increased flexibility and customization.
The degrees of freedom in designing such ML accelerators are very large. It is time-wise impossible to develop each of them at RTL level to assess their relative performance. When migrating from single-core to multi-accelerator heterogeneous systems, the design space as well as the scheduling or mapping space again increases drastically. Moreover, the optimal hardware architecture is tightly interwoven with the optimal execution schedule when mapping different workloads on the hardware, requiring co-optimization. To enable this, a rapid modeling and design/scheduling space exploration (DSE) frameworks are developed at MICAS, called ZigZag (for single core) and Stream (for multi-core). ZigZag and Stream are available open source, and is continuously expanded by our team. In 2023, our tool suite was extended with ZigZag-IMC to also model in-memory computing architectures.
All frameworks are available fully open-source on github, using the links in the text above.
In 2023, Prof. Verhelst’s Bingo ERC project launched. Bingo tackles the problem of the discrepancy between the slow development cycle of processor chips (many months to years) and the high-pace evolutions of ML algorithms (hours to weeks). This bottleneck is also known as the “hardware lottery”, and holds back innovation, severely impacts embedded AI execution efficiency, and narrows the market to a few large companies. The BINGO vision to break this innovation deadlock is to enable heterogeneous compute platform customization for a given AI workload in a matter of days (100x faster), through rapid selection and assembly of prefabricated co-processor chiplets. A new team at MICAS will enable that vision in the coming 5 years of the Bingo project.