HyCUBE: A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect

CGRAs excel in-terms performance and energy consumption due to their statically determined execution of instructions that remove overheads of synchronization and conflict resolution on hardware with the help of an advanced compilation process.

However, it seems that their static arrangement of the interconnect among PEs could be limiting in creation of better schedules for the PEs of CGRA. Thus, HyCUBE introduces a cycle-by-cycle reconfigurable interconnect that solves this problem bringing upto 2.3X performance improvements.

Conventional CGRAs usually have an arrangement of processing elements (PEs) where each of them are connected to their neighbors. Moreover, once the execution begins, each PE could read data from neighboring PEs (including itself) and produce data to be read by neighboring PEs (including itself). Let’s see in a more illustrative example :

HyCUBE’s ability to connect to different PEs on different cycle, subject to the constraint that one link per cycle is used only by a single data, could improve the schedule being able to placing as much as dependent instructions in the next cycle itself as illustrated below :

 

HyCUBE’s ISA that enables cycle-basis reconfigurable interconnectivity

Each instruction of HyCUBE’s PE has configuration bits to configure the crossbar switch. Thus, they operate in cohesion to establish multi-hop connectivity in the same-cycle. In the following example : PE12, PE13, PE9 and PE14 all co-operate to establish link from PE12 –> (PE9, PE14).

HyCUBE Compiler

We used LLVM to implement HyCUBE compiler that directly takes annotated loops in C code and maps into time-extended resource graph that represent the HyCUBE architecture.

 

Results

We compared the performance-per-watt of HyCUBE normalized to neighbor-to-neighbor (N2N) connected CGRA that uses explicit instructions for data movement as well as a CGRA with a programmable routers (STDNOC) that allows concurrent communication and computation.

After the tape-out we measured the performance of few benchmarks running standard platforms : ARM-Cortex M Series and Xilinx Zynq FPGAs.

 

Further Reading :

[DAC] HyCUBE : A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect
Manupa Karunaratne, Aditi Kulkarni, Tulika Mitra, Li-Shiuan Peh
54th ACM/IEEE Design Automation Conference, June 2017

[A-SSCC] HyCUBE: a 0.9V 26.4 MOPS/mW, 290 pJ/op, Power Efficient Accelerator for IoT Applications
Bo Wang, Manupa Karunarathne, Aditi Kulkarni, Tulika Mitra, Li-Shiuan Peh
IEEE Asian Solid-State Circuits Conference, November 2019