
Introduction
Following the gap analysis done in the second half of 2023, the Vector Special Interest Group (SIG-Vector) has been working on specifying instructions to accelerate matrix operations. Two Task Groups were proposed to explore approaches that may be applicable to different markets.
The Attached Matrix Extension Task Group (AME-TG) has deep learning and other Artificial Intelligence-related workloads as the primary focus. The group should specify a set of instructions that is independent of other RISC-V extensions. In particular, this approach should allow embedded and other low-cost implementations to simplify the design by skipping the Vector extension but still having AI acceleration instructions. However, this matrix extension will expand the architecture state with the introduction of Matrix Registers, which may be a problem in applications with frequent context switching. This approach is similar to how other architectures added matrix operations, like Intel’s AMX (Advanced Matrix Extensions) and ARM’s SME (Scalable Matrix Extension).
The Integrated Matrix Extension Task Group (IME-TG) primarily focuses on the HPC market and proposes developing an instruction set that reuses the Vector Registers introduced by the Vector extension to perform matrix operations. This approach reduces the impact of the extension in context-switching and resembles how the POWER architecture added matrix operations with the MMA (Matix-Multiply Assist) facility. It may also help applications that interleave matrix and vector operations by avoiding data movement between different types of registers.
In this work, we developed a QEMU TCG Plugin called Vector-Matrix Profiler (VMP) to investigate the potential impact of data movement between Vector and Matrix Registers. With POWER10’s MMA as a proxy for what a RISC-V IME implementation should look like, we used this plugin to instrument the execution of eight Convolutional Neural Networks (CNN) optimized to use the POWER10 matrix operations and profile the interaction between scalar, matrix, and vector instructions.
The results show that this type of workload has very little interaction between vector and matrix registers, indicating that the two convolution algorithms tested do not currently explore this characteristic of the IME approach. The collected data can also give some insights into the type of vector operation that interacts with matrix data and would be helpful in an AME implementation to avoid sending data back to memory.
Implementation
POWER (Performance Optimization With Enhanced RISC) is a RISC-like architecture with 32 General Purpose Registers (GPRs), 32 Floating-Point Registers (FPRs), and 64 Vector-Scalar Registers (VSRs). The VSRs have a fixed width of 128 bits, and the first 32 registers share the higher 64 bits with the FPRs, as shown in Figure 1.
Figure 1: FPR and VSR mapping
Power ISA version 3.1 introduced the Matrix-Multiply Assist (MMA) Facility, adding an IME-like set of instructions to perform rank-k outer products (with k up to 8, depending on the underlying element type). The instructions take the 128-bit VSRs as input, and the output is placed in one of the eight 512-bit accumulator registers. Each accumulator is associated with a group of 4 VSRs, as shown in Figure 2. A more detailed description of POWER10 MMA can be found in this paper and the Power ISA v3.1.
Figure 2: VSR and ACC mapping
Eight CNN models were compiled with an ONNX-based toolchain that outputs POWER10 MMA code for an MLIR/LLVM compiler. The compiler uses two convolution algorithms to generate POWER10 MMA binaries: (a) SConv direct convolution and (b) a baseline, Im2Col+BLAS implementation, that we call BLAS, for short. More details about the toolchain can be found on the SConv paper.
QEMU is a generic machine emulator and virtualizer. When working as a virtualizer, QEMU leverages hypervisor-based accelerators like KVM and Xen to run the guest code directly on the host CPU. When the guest CPU is emulated, the Tiny Code Generator (TCG) accelerator dynamically translates guest instructions into code that the host can execute. QEMU TCG also supports plugins that provide an API to subscribe to events during the translation and execution of guest instructions, allowing runtime instrumentation of the guest code.
Figure 3 shows the overall flow of the VMP tool. VMP takes the .so file, generated by the compilation toolchain, as input and loads a Python script to run the inference on each model. QEMU’s user-mode emulation was used to execute the Python interpreter, so the analysis excludes the model compilation, context switches, or any privileged code but includes the ELF loader (ld*.so), the Python interpreter, and other libraries/modules that might be loaded at runtime (e.g., libc, libz, numpy, etc.).
Figure 3: The Vector-Matrix Profiler (VMP) toolchain flow
As shown in Figure 4, the TCG plugin registers a callback to inspect Translation Blocks (TBs). Each translated instruction is analyzed to identify its class and the accessed registers. A callback is registered for the instruction execution with the “user data” pointer storing the class and input/output registers.
Figure 4: Overview of the Vector-Matrix Profiler (VMP) instruction instrumentation
A 64-element array is used to store the class of the instruction that generated the value of each VSR. Two arrays with one element for each class are used to track the number of instructions and the number of VSR writes per class. A bidimensional array counts VSR reads, with one dimension for the Source Class (i.e., the class of the instruction that generated the value) and the other for the Destination Class (i.e., the class of the instructions reading the value).
Figure 5 shows the plugin behavior when an instruction is executed. The class, input, and output registers are recovered from the “user data” pointer, and the counter for the pair of Source and Destination classes is incremented for each input register. Then, the output registers are used to update the VSR array.
Figure 5: How VMP classifies instructions
The plugin keeps track of every access to VSRs. A write to a VSR updates the plugin state to indicate what class of instruction generated the value of each register. When another instruction reads the value of this register, the plugin increments a counter for the pair of Source and Destination classes.
VMP classifies the profiled instructions in the following classes:
- General Purpose: Instruction that moves data between GPRs and VSRs. E.g., “Vector Insert Word from GPR using GPR-specified Right-Index” (vinswrx).
- Float: Instructions on the floating-point facility, including load and store operations that target FPRs. E.g., “Load Floating-Point Double” (lfd), “Floating Multiply-Add” (fmadd), etc.;
- Immediate: Instructions that load immediate values (e.g., “VSX Vector Splat Immediate Word”, xxspltiw) or instructions that load a constant value according to an immediate field (e.g., “Load VSX Vector Special Value Quadword”, lxvkq);
- Load/Store: Load and Store instructions that read to/from VSRs. E.g., “Store Vector Indexed” (stvx), “Load VSX Vector Paired” (lxvp), etc.;
- Scalar: Instructions that are part of the Vector-Scalar Extension (VSX) Facility and only operate on the first element of the VSRs. E.g., “VSX Scalar Multiply-Add Type-A Double-Precision” (xsmaddadp), “VSX Scalar Divide Single-Precision” (xsdivsp), etc.;
- Conversion: Instructions that convert the type of the elements in a VSR. e.g., “VSX Vector Convert bfloat16 to Single-Precision format Non-signaling” (xvcvbf16spn), “VSX Scalar Convert with round to zero Double-Precision to Signed Word format” (xscvdpsxws);
- Matrix: Instructions of the Matrix Multiply Assist (MMA) Facility. E.g., “VSX Vector 16-bit Floating-Point GER (rank-2 update)” (xvf16ger2), “VSX Vector 64-bit Floating-Point GER (rank-1 update) Positive multiply, Positive accumulate” (xvf64gerpp);
- Vector: Instructions of the Vector Facility or the VSX Facility, except those classified under the Immediate, Load/Store, Conversion, Scalar, and Matrix classes. E.g., “Vector Multiply-Add Floating-Point” (vmaddfp), “Vector Gather every Nth Bit” (vgnb), “VSX Vector Multiply-Add Type-A Double-Precision” (xvmaddadp), etc.
At exit, the arrays counting the number of instructions, reads, and writes are written to QEMU logs.
Results
Figure 6: VSR reads heatmap
The data collected by VMP is available as an Excel File. Figure 6 presents the bidimensional array of VSR reads as a heatmap for each model and algorithm tested. The following observations can be made from the data:
- One-third of the reads by matrix instructions use values from load instructions and two-thirds access values produced by other matrix instructions. There is no matrix instruction operating on the result of vector instructions;
- Except for the “mnist-8” model (which is very small), matrix instructions account for 80~90% of all VSR reads, with matrix instructions reading the results of other matrix instructions accounting for 50~60% of the total. As expected, percentages are higher for larger models;
- SConv consistently performs more reads of matrix instructions than the BLAS baseline implementation and has a higher percentage of Matrix instructions;
- Most reads (~98%) of matrix results are done by other matrix instructions, with the remaining reads (~1%) done by vector instruction. With more specialization of the plugin classes, we could determine that vector multiply-add instructions (like xvmaddasp) were the only type of vector instructions consuming the data generated by matrix instructions.
- There is no direct store of matrix results. A vector multiply-add instruction always processes the VSR before its value is returned to memory, which accounts for a very small percentage of the instruction count (<5%);
- Some store and floating-point instructions read data from uninitialized registers (or unidentified sources). This might be related to the nonvolatile registers (i.e., “callee saved”) of the ELFv2 ABI. The few floating-point reads of matrix results can also be caused by the nonvolatile FPRs (f14 to f31).
These results show that the analyzed workloads realize a relatively small number of operations mixing vector and matrix compared to the amount of computation realized with just matrix instructions, indicating that the performance of the existing AI/ML software optimized for an IME-like architecture does not rely on the fact that the IME approach uses a single type of register for matrix and vector data.
Also, the only vector operations observed consuming matrix data were the vector multiply-add instructions. If the future AME extension provides instructions to execute this type of operation using only Matrix Registers, the implementation of these algorithms would not require data movement between register types.
Conclusion
These results show that the analyzed workloads realize a relatively small number of operations mixing the Vector and Matrix compared to the amount of computation realized with just Matrix instructions. In particular, it was observed that:
- One-third of the reads by matrix instructions use values from load instructions and two-thirds access values produced by other matrix instructions. There is no matrix instruction operating on the result of vector instructions.
- Except for very small models, matrix instructions account for 80~90% of all VSR reads, with matrix instructions reading the results of other matrix instructions accounting for 50~60% of the total. As expected, percentages are higher for larger models;
- Most reads (~98%) of matrix results are done by other matrix instructions, with the remaining reads (~1%) done by vector multiply-add instructions (e.g., xvmaddasp);
- There is no direct store of matrix results. A vector multiply-add instruction always processes the VSR before its value is returned to memory, which accounts for a very small percentage of the total instruction count (<5%).
Want To Showcase Your Work?
Share your project in the Learn Repository on GitHub! You might find future collaborators or an organization interested in working with you.
Need further instructions? Learn more here!
Meet the Authors
Matheus Ferst is a software developer at the Embedded Computing Department of Instituto de Pesquisas Eldorado. He graduated in Computer Engineering at Universidade Tecnológica Federal do Paraná and holds a Master’s in Electrical Engineering from the same institution. He is also an open-source enthusiast, contributing mainly to the QEMU project.
Guido Araujo received a Ph.D. in Electrical Engineering from Princeton University in 1997. He is a Full Professor of Computer Science and Engineering with UNICAMP. His current research interests include code optimization, parallelizing compilers, and compiling for accelerators, which are explored in close cooperation with industry partners. He has published over 120 scientific papers and received five best paper awards. He was awarded two Inventor of the Year and two Zeferino Vaz Research Awards from UNICAMP. His students received two Best Ph.D. Thesis Awards from the Brazilian Computing Society. He was a member of the technical advisory board of Eldorado and SIDI R&D Institutes and the CI-Brasil Council at the Brazilian Ministry of Science and Technology. He co-founded the companies Kryptus and Idea! and is currently a member of the Editorial Board of IEEE MICRO.