Domain-specific architectures (DSAs) or hardware accelerators are typical innovations that are leading computer architecture into a new golden age. In a heterogeneous system, these tailored processors (accelerators) are managed by and can work in parallel with the general-purpose CPUs with the help of high-speed input/output (I/O) bus or System on Chip (SoC) bus. However, the high communication overhead makes such loosely coupled architecture unsuitable for small-scale or low-latency tasks. Although integrating accelerators into the CPU pipeline as functional units can significantly reduce the interaction latency, due to the performance side effects to CPU micro-architecture and the increasing design and verification complexity of processors, such tightly coupled architecture is only suitable for very simple tasks. Moreover, the speedup (or utilization) of the accelerator would become limited, because of the different design principles of specialized hardware accelerators and general-purpose CPUs.