4D-CGRA: Introducing Branch Dimension to Spatio-Temporal Application Mapping on CGRAs

CGRAs excel in-terms performance and energy consumption due to their statically determined execution of instructions that remove overheads of synchronization and conflict resolution on hardware with the help of an advanced compilation process.

However, most embedded applications consist of control-divergence — existence of if-(elseif)-else structures –, in the application kernels. Therefore, leading to reservation of resources in time-slots for the both of execution paths (See the right figure)

We introduce 4D-CGRA — that encourages mutuallyexclusive dataflows to map to the same set of resources but allows execution of the appropriate dataflows at runtime based on the branch outcomes. We achieve this by introducing an architecture-enabled new branch dimension corresponding to the branching decisions.

Shards

4D-CGRA uses partitions of the basicblocks of a control-divergent loop kernels and map them to different PEs. This is required because CGRA is supposed to extract most parallelism in each possible execution path. Moreover, some basicblocks are mutually exclusive to some of the basicblocks (See the figure below — e.g., BB2 and BB3). The shards belonging to mutually exclusive basicblocks are allowed to share PEs because only one of them will be executed. Therefore, such placement is performed in the new dimension — branch dimension.

4D-CGRA Architecture

We realized the shard-based execution model through implementing a novel PE architecture. The shards are placed as a contiguous set of instructions in the control memory of the PE. The selection of the correct shard is performed using a “tag” that is coupled with the data flows in the CGRA.

Every data that is produced in the PEs, generates a tag that corresponds to the execution path that the instruction belongs to. Such tags are distributed to other PEs. When a tag arrives at a PE, the tag is used to select the correct shard for execution. Therefore, two or more, mutually exclusive sets of instructions could be mapped to the same PE resource at the same time-step. This exactly what the 4D-CGRA compiler does.

4D-CGRA Compiler

The 4D compiler performs the mapping of instructions and their inter-dependencies onto a 4-dimensional resource graph (2-spatial , 1-temporal and 1-branch) as opposed to conventional 3-dimensional resource graph (2-spatial and 1-temporal). The goal is to maximize the utilization of a physical PE resource through mapping as many mutually exclusive instructions in the branch plane.

The high-level idea of the algorithm is to start with the longest execution path in the application that would most-likely determine the performance (Worst-case execution time determines performance in static schedules). Thereafter, the rest of the execution paths could utilize the branch planes to fit in the cycles of the longest path without delaying the longest execution path due to branching.

Results

We observed that the 4D-CGRA could achieve upto 2.3X better performance-per-watt (depending on the amount of branches) — while exhibiting average improvement of 43% performance-per-watt for a range of embedded application kernels. In order for a fair comparison, we scaled down the 4D-CGRA (and other recent baseline that handles control divergence) to match the area of a 4×4 generic CGRA architecture.

Embedded Computing Lab

4D-CGRA: Introducing Branch Dimension to Spatio-Temporal Application Mapping on CGRAs

Shards

4D-CGRA Architecture

4D-CGRA Compiler

Results

Further Reading :