XDMA: A Distributed DMA for Flexible and Efficient Data Movement in Heterogeneous Multi-Accelerator SoCs

Yunhao Deng and Fanchen Kong , Marian Verhelst Hardware-efficient AI and ML

Research Goal
The growing demand for compute performance, together with advances in silicon technology, has driven the integration of multiple heterogeneous accelerators into single Systems-on-Chip (SoCs). This integration aims to deliver higher performance and energy efficiency for compute-intensive workloads. While data access between memory subsystems and accelerators has been extensively optimized, data exchange between accelerators remains largely overlooked, limiting the overall performance of heterogeneous SoCs.

Data copying across heterogeneous accelerators raises three interrelated challenges:

  1. Memory-boundedness of modern workloads.
    Modern workloads, such as large language models (LLMs), are increasingly memory-bound due to limited data reuse. Simply scaling compute resources is insufficient if the underlying data movement cannot keep up.
  2. Support for complex in-memory data layouts.
    In-memory data layouts must align with the diverse access patterns of different accelerators. Suboptimal layouts can drastically increase inference latency—by up to two orders of magnitude—because explicit data layout transformations are both energy- and latency-intensive. Although Direct Memory Access (DMA) engines offer high bandwidth utilization, they are typically efficient only for contiguous memory accesses. Supporting complex, accelerator-specific layouts often requires additional software loops for address generation, causing excessive control overhead and underutilization of on-chip interconnect bandwidth.
  3. Efficient point-to-multipoint (P2MP) data movement.
    When the same data must be copied to multiple destinations (e.g., broadcasting model parameters or shared activations), traditional DMA engines perform repeated read–write operations for each target. This leads to redundant traffic and poor energy efficiency. Addressing this P2MP requirement calls for multicast-like capabilities. However, standard interconnect protocols lack native multicast support, and existing P2MP solutions—such as multicast-capable Networks-on-Chip (NoCs)—introduce significant hardware overhead and require protocol modifications, undermining scalability and compatibility with existing SoC fabrics.

Recent Results for XDMA
This project proposes a distributed DMA architecture, called XDMA, enabling flexible and efficient data movement within heterogeneous multi-accelerator SoCs to address these three challenges. To tackle the in-memory layout problem, a data streaming engine integrates hardware-based address generators, replacing software-based ones and thereby reducing control overhead while sustaining high interconnect utilization. To support efficient SoC-level broadcasting, an application-layer broadcasting mechanism named Chainwrite relocates the multicast operation from network routers to the DMA endpoints. Chainwrite preserves the peer-to-peer nature of data transfers while providing scalable, energy-efficient delivery of identical data to an arbitrary number of destinations.

XDMA Frontend and the XDMA Backend are both open-sourced.

Get in touch
Yunhao Deng
Phd student
Fanchen Kong
Phd student
Marian Verhelst
Academic staff

Publications about this research topic

  • Fanchen Kong*, Yunhao Deng*, Xiaoling Yi, Ryan Antonio and Marian Verhelst, "XDMA: A Distributed, Extensible DMA Architecture for Layout-Flexible Data Movements in Heterogeneous Multi-Accelerator SoCs," ICCD 2025 
  • Yunhao Deng*, Fanchen Kong*, Xiaoling Yi, Ryan Antonio and Marian Verhelst, "XDMA2: A distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement," DATE 2026 

Other research topics in Hardware-efficient AI and ML

Vertically-Integrated Logic Fabrics for Future 3D Computing Platforms
Hardware-efficient AI and ML
Jannes Willemen | Marian Verhelst
Precision-Scalable Microscaling Hardware for Continual Learning at the Edge
Hardware-efficient AI and ML
Stef Cuyckens | Marian Verhelst
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Hardware-efficient AI and ML
Man Shi, Arne Symons, Robin Geens, and Chao Fang | Marian Verhelst
Massive parallelism for combinatorial optimisation problems
Hardware-efficient AI and ML
Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst
Carbon-aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun and Fanchen Kong | Georges Gielen and Marian Verhelst
Integer GEMM Accelerator for SNAX
Hardware-efficient AI and ML
Xiaoling Yi | Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.