Precision-Scalable Microscaling Hardware for Continual Learning at the Edge

Stef Cuyckens , Marian Verhelst Hardware-efficient AI and ML

Research goals: Microscaling (MX) formats promise a single numerical framework that can cover both low bit-width inference and high dynamic range training, which is exactly what edge continual learning in robotics and next-generation NPUs need. Building on this, our work aims to create precision-scalable MX hardware that (i) enables efficient on-device learning for autonomous robots by supporting all six standardized MX data types in a single MAC array, and (ii) integrates such MX datapaths into a general-purpose NPU platform with streaming and control support for mixed precision training and inference. Together, they target a unified compute fabric where one MAC array can fluidly switch between MXINT8 and multiple MXFP modes, while the surrounding architecture and NPU integration sustain high throughput and energy efficiency across diverse workloads such as robotics policies and generic DNNs.

Gap in the SotA: Existing continual learning processors and MX accelerators fall short in two key ways. At the accelerator level, state-of-the-art systems such as Dacapo only support MXINT-like formats and rely on vector-based shared exponent groups. This organization either forces two copies of the weights in memory (to separately serve forward and backward paths) or requires storing the weights in full precision and quantizing them on the fly. Both options are inefficient from a storage perspective and clash with tight memory budgets on edge robots. At the MAC and NPU level, prior MX MACs are dominated by heavy, exponent-aware reduction trees, where a large fraction of area and energy is spent in accumulation, and NPU streamers are provisioned for static worst-case bandwidth, leading to over-provisioned channels and bank contention when precision is reduced. These limitations prevent current MX solutions from delivering truly precision-scalable, system-efficient MX processing across both training and inference.

Results: Our first work introduces the first precision-scalable MX MAC unit that supports all six MX data types, using 2-bit sub-word multipliers and a unified integer-floating-point datapath, and organizes MX values in 64-element square blocks rather than 32-element vectors to make forward and backward passes symmetric without duplicate storage or on-the-fly requantization. Implemented as an MX processing array and GeMM core in TSMC 16 nm at 400 MHz, this design achieves a substantial reduction in memory footprint and a multi-fold increase in effective training throughput over Dacapo under iso peak throughput, while maintaining comparable energy efficiency on several robotics learning workloads, enabling practical continual learning at the edge. The second work then optimizes the MX MAC’s dominant reduction tree with a hybrid integer-floating-point accumulation scheme that relaxes accuracy where safe, and integrates an 8×8 MAC array into the SNAX NPU platform with bandwidth-aware data streaming. The resulting system reaches 657, 1438 to 1675, and 4065 GOPS/W for MXINT8, MXFP8/6, and MXFP4 at throughputs of 64, 256, and 512 GOPS, respectively, improving MX MAC energy efficiency over the previous state of the art while providing a deployable NPU building block for MX-based continual learning.

Get in touch
Stef Cuyckens
Phd student
Marian Verhelst
Academic staff
Overview of our precision-scalable Microscaling (MX) work.
Overview of our precision-scalable Microscaling (MX) work.

Publications about this research topic

Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, Marian Verhelst. "Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning" to appear in 2025 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED 2025).

Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst. "Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration" to appear in 2026 IEEE Asia and South Pacific Design Automation Conference (ASP-DAC 2026).

Other research topics in Hardware-efficient AI and ML

Vertically-Integrated Logic Fabrics for Future 3D Computing Platforms
Hardware-efficient AI and ML
Jannes Willemen | Marian Verhelst
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Hardware-efficient AI and ML
Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst
Massive parallelism for combinatorial optimisation problems
Hardware-efficient AI and ML
Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst
Carbon-aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun and Fanchen Kong | Georges Gielen and Marian Verhelst
Integer GEMM Accelerator for SNAX
Hardware-efficient AI and ML
Xiaoling Yi | Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.