DNestMap: Mapping Deeply-Nested Loops on Ultra-Low Power CGRAs

Most real-world applications, however, comprise of deeply-nested loops with complex and often irregular control flow structures that cannot be mapped to CGRAs by existing compilers. This leads to excessive data transfer costs as the execution continuously alternates between the outer loop-nests on the host processor and the innermost loop on the CGRA accelerator. Moreover, ultralow power CGRAs can only include limited on-chip memory to cache the configuration bitstreams and need frequent swapping of configurations in the presence of multiple innermost loops. We introduce DNestMap, a partitioning and mapping tool for CGRAs, that can judiciously extract the most beneficial code segments of multiple deeply-nested loops and effectively cache them together statically in the configuration memory through spatio-temporal partitioning. DNestMap achieves 1.58X performance improvement compared to dynamic caching of configuration contexts of the innermost loops in the CGRAs with limited on-chip memory.

Problem of Limited Configuration Memory

The single-most power consuming element in the ultra-low power CGRAs would be the configuration memory. Thus, making the configuration real-estate extremely valuable. However, if we map the nested loops in the best version (i.e., each loop utilizing all the PEs) it might consume more space in the configuration memory. This is further illustrated in the following animation.

 

DNestMap Framework

 

 

 

 

 

 

 

 

 

 

DNestMap automates the selection of the set of loops that should be mapped to the CGRA, considering both data and control transfers. Moreover, in-order to save the valuable space in the limited configuration memory, DNestMap creates multiple schedules that uses a subset of PEs (i.e., spatio-temporal partitions) with higher density of instructions where each schedule might not give the best performance. Thereafter, produces a tight static packing inside the configuration memory that would out perform dynamic caching based methods (e.g., CacheInner2 in the figure) and static caching methods that employ all the PEs for each loop (e.g., StaticTemporal in the figure).

Results

 

Further reading :

[DAC] DNestMap : Mapping Deeply-Nested Loops on Ultra-Low Power CGRAs
Manupa Karunaratne, Cheng Tan, Aditi Kulkarni, Tulika Mitra, Li-Shiuan Peh
55th ACM/IEEE Design Automation Conference, June 2018