CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

CASCADE is a novel decoupled access-execute CGRA design with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to custom-designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3x performance benefit and 2.2x performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

Memory Access Bottleneck in CGRAs

CGRA exploits pipeline parallelism in loop kernels by parallelly executing multiple independent operations. In memory-bound applications, the attainable parallelism is limited by memory access because CGRA needs to access multiple data elements in a single cycle. Typically, multi-bank on-chip memories are employed with CGRAs to tackle this problem. However, having multiple banks does not solve the problem single-handedly. The data placement inside the banks also matters (Figure: memory conflicts.gif). Conflict free data placement makes sure that the parallel accessed data are placed in different banks. Thus, it increases the number of memory accesses per cycle compared to naive data placement. However, conflict-free data placement with multi-bank memories comes with an overhead of address generation. Address generation becomes complex when the data is placed in a conflict-free manner. In conventional approach address generation happens inside the CGRA. Therefore, the CGRA resource requirement for the address generation increases. We observe that the overhead of address generation could nullify the benefits of conflict-free data placement.

CASCADE Architecture and Compiler

CASCADE offloads the address generation to custom programmable address generation unit called stream engine. Therefore, CGRA resources are dedicated to computation. The stream engine generates addresses of the data placed in on-chip multi-bank memory in a conflict-free manner (Figure: architecture). Stream engine supports complex instructions which can generate multi-bank conflict-free addresses in a single cycle. CASCADE compiler determines the conflict-free data placement at compile time by analyzing the memory access patterns. CASCADE compiler also generates the hardware (stream engine and CGRA) configurations including software pipelined CGRA schedule and conflict-free address generation parameters (Figure: compiler).

Embedded Computing Lab

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

Memory Access Bottleneck in CGRAs

CASCADE Architecture and Compiler

Further reading :