Zhao Jiang, Alibaba DAMO Academy
In recent years, the rapid advancement of AI technologies, particularly deep learning, has significantly increased the demand for computing power, especially for matrix operations in various AI applications. Traditional AI acceleration techniques based on vector computing are now struggling to keep pace with these growing requirements. As a result, CPU manufacturers are gradually shifting their focus from vector computing to matrix computing. Unlike vector computing, matrix computing allows for greater data reuse, which reduces bandwidth demands and enhances energy efficiency by minimizing data transfers.
The XuanTie team has developed a custom Matrix-Multiply Extension (MME) based on the RISC-V architecture, and also initiate the Attached Matrix Extension Task Group (AME), bringing together community members to collaboratively define a standard matrix extension instruction set. In this blog, we will introduce the XuanTie Matrix Extension ISA and the XuanTie C907 processor, the first XuanTie IP to implement this matrix extension.
Matrix Extension Instruction Set
In the design of the XuanTie MME, the architectural concepts and implementation methods of the RISC-V Vector Extension (RVV) were leveraged. This allows for flexible scalability in computational power within the programming model. By selecting different hardware implementation parameters, the peak computational power for single-instruction matrix multiplication can be scaled, covering a range from 0.5 Tops to 32 Tops.
The MME architecture adopts a decoupled design from the vector extension, where the source and destination operands for matrix multiplication instructions are stored in independent matrix registers rather than in vector registers. This approach offers greater flexibility compared to architectures that integrate matrix operations with vector operations. MME’s independent programming model allows for customizable ratios of matrix to vector computational power, tailored to different application requirements, and enables parallel execution of matrix and vector operations. Additionally, the full separation of matrix and vector structures simplifies both hardware implementation and software development.
The XuanTie MME extends eight two-dimensional matrix registers, with the recommended configuration using four of these registers as accumulation registers to store intermediate results of matrix C. The A and B operands each utilize two matrix registers, achieving a data reuse rate of 2, which helps reduce memory bandwidth requirements.
Each matrix register in the XuanTie MME contains MLEN bits of state, with each row consisting of RLEN-bit states, resulting in a total of RLEN/32 rows. RLEN, a hardware implementation parameter, can support values of 128, 256, 512 bits, or higher. For instance, when RLEN is set to 512 bits and the operand type is FP32, a single register can store a 16×16 matrix. During hardware implementation, the appropriate RLEN size can be selected based on computational power and bandwidth requirements, allowing for scalable computational capabilities when combined with instruction execution throughput. The MME also provides matrix configuration instructions, enabling software to define the matrix size involved in the computation, with support for tail processing.
The matrix multiply-accumulate operation is the primary focus of acceleration for the MME. The operational method of the MME extended matrix multiply-accumulate instruction is illustrated in the figure below: matrix A is multiplied by matrix B, and the result is then accumulated with matrix C, with the final result stored back into matrix C.
MME supports matrix multiplication and accumulation operations with various data types, including float32, float16, bfloat16, int8, and int4. For floating-point data types, it offers 2x expanded precision for multiply-accumulate operations, such as accumulating bfloat16/float16 into float32. For integer operands, it provides 4x expanded precision support, such as accumulating int8 into int32. The list of supported operations is provided below. Additionally, the XuanTie MME also extends support for mixed-precision matrix multiplication and accumulation instructions to meet the needs of large model applications.
MME provides matrix memory access instructions, supporting data load or store operations for matrix registers. These instructions load multiple rows from consecutive byte/half-word/word/double-word lines into matrix registers or store register data back into memory. MME also offers streaming load/store instructions to support stream data access, allowing the hardware to optimize this type of memory access for improved performance. In addition, instructions for loading/storing the entire matrix register are provided, which accelerate context switching between different computational tasks. The assembly format of the instruction is illustrated in the figure below, where rs1 serves as the base address and rs2 as the row stride.
#matrix load/store mld<b/h/s/d> md, rs2, (rs1) mst<b/h/s/d> ms3, rs2, (rs1) #stream matrix load/store msld<b/h/s/d> md, rs2, (rs1) msst<b/h/s/d> ms3, rs2, (rs1) #whole matrix load/store mld<1/2/4/8>m<b/h/s/d> md, (rs1) mst<1/2/4/8>m<b/h/s/d> ms3, (rs1)
In addition to matrix multiplication-accumulation and memory access instructions, the XuanTie MME extension also provides a range of operations such as matrix addition, matrix subtraction, and matrix shifting. These operations enhance the versatility of matrix manipulation, addressing the needs of common AI operations like fusion, quantization, and re-quantization.
XuanTie C907 with Matrix Extension
The XuanTie C907 is a low-cost, high-efficiency multi-core RISC-V processor. It adopts a 9-stage partially dual-issue in-order pipeline architecture and is primarily used in fields such as vision terminals, human-computer interaction, and wireless communication. For the first time, the XuanTie C907 is equipped with the XuanTie Matrix extension, offering floating-point and integer matrix computation capabilities tailored for AI applications.
Based on the MME extension, XuanTie offers comprehensive full-stack software and hardware support. HHB (Heterogeneous Honey Badger tools collection), a neural network model deployment toolkit provided by XuanTie. HHB includes tools for compilation optimization, performance analysis, process debugging, and result simulation, covering all necessary deployment stages. It supports the XuanTie CPU series processors, including the XuanTie C907 with its matrix extension engine.
As shown in the figure below, neural networks implemented with MME can achieve a 4-7x speedup compared to standard vector-based implementations.
Summary
Focusing on AI application, the XuanTie MME offers several key features:
1 – Scalability: MME supports Register Length (RLEN) ranging from 128 bits to 1024 bits or more, allowing peak computational performance to scale from 0.5 TOPS to 32 TOPS.
2 – Binary code portability: MME ensures that changes to Matrix Length (MLEN) do not require rewriting or recompiling the binary code for execution.
3 – Decoupling: XuanTie MME separates the Matrix extension from the Vector extension, offering greater flexibility for custom chip designs.
With the MME extension, the XuanTie C907 processor integrates a matrix computation engine for the first time, delivering enhanced computational power for AI applications and offering full-stack software and hardware support. The XuanTie team has also open-sourced the MME design on GitHub, where more detailed information about the architecture is available. We invite the community to explore the design with us on GitHub, join the discussion, and share your feedback.
GitHub: https://github.com/XUANTIE-RV/riscv-matrix-extension-spec