Context: Deep learning has revolutionized applications in computer vision, natural language processing, and signal analysis. However, the increasing size of networks and intermediate data values challenges energy efficiency, latency, and memory footprints, particularly at the edge. Multi-core and heterogeneous hardware accelerators have emerged as a solution, but traditional layer-by-layer processing and pipelining fail to meet the demands of latency-critical edge applications due to high memory access overhead and underutilized computational resources.
Breakthrough with Stream Framework: Our research introduces Stream, an open-source design space exploration framework that pioneers the fine-grained scheduling of layer-fused deep neural networks on heterogeneous dataflow accelerators. By integrating memory- and communication-aware latency modeling, fine-grained data dependency generation, and constraint-optimized workload allocation, Stream achieves up to 2.2× energy-delay product (EDP) improvement over traditional scheduling. Stream's innovations, validated on state-of-the-art hardware, demonstrate the potential of layer fusion to drastically reduce off-chip memory accesses, enhance on-chip data retention, and optimize core utilization, setting new benchmarks for efficiency in edge AI applications.