The recent development of the RISC-V IOMMU effort has attracted substantial attention from the RISC-V community. Xuantie IOMMU from T-Head Semiconductors of Alibaba Group, holding one of the five independent proposals, features a design that is in sync with the current IOMMU Task Group’s charter.
An Input / Output Memory Management Unit (IOMMU), analogous to the Memory Management Unit (MMU) in a CPU, is used to regulate the access to main memory by peripheral devices in a computer system. In order to better support the varieties of business of the Alibaba Group, the Xuantie IOMMU is designed to provide support for various system configurations, from small Internet-of-Things devices (IoT) to large servers on clouds.
While maintaining compatibility with existing page table formats, defined by the RISC-V Privileged Architecture, IOMMU additionally introduces necessary data structures for mapping peripheral devices to individual address spaces. Such data structures are collectively referred to hereafter as the ‘device structure.’ It is destined to be flexible and scalable. At the same time it allows for careful integration with virtualization-based systems and data security.
In order to cover systems ranging from IoT devices to cloud servers, the Xuantie IOMMU adopts a flexible definition for device identifiers coupling with similarly dynamic table structures. This specification allows one level and two levels of tables, which can be easily expanded to more levels (e.g. BDF and PASID in the PCIe specification). The width of the indices for different levels are also customizable. In contrast to a fixed-width approach, the memory footprint of the IOMMU design is minimized, freeing freedom for more software processing and page-based protection when needed. Optimal solution can be achieved by parameter adjustments with system integrators.
Another feature of the Xuantie IOMMU’s device structures is the absence of device-specific descriptors in leaf tables. Compared to the existing embedment, this design offers three benefits. Firstly, this device structure maintains uniform table arrangement (i.e. arrays of pointers). Moreover, it lifts restrictions on the width of the indices; recall the size of the table grows correspondingly. This is consistent with the aforementioned design goal. Last but not the least, it simplifies software architecture based on the concept of partitions.
To further the third point, consider a secure domain on a hart that controls the hardware IOMMU. After proper isolation is put in place, the secure domain delegates certain devices for the other normal domain’s free use. With an out-of-table descriptor, the normal domain can allocate its own descriptor and submit it to the secure domain. Subsequently, the secure domain attaches the descriptor into a hardware IOMMU’s device structure. The normal domain is then freed up to manage configuration without invoking the secure domain. A couple of complications are incurred with an inside-of-tree descriptor: The “fork out” option is not always readily available in the secure domain due to the existence of multiple descriptors. Also, a software interface is needed for proper interpretation in subsequent execution requests. The software in the two domains may come from independent vendors and it is desirable to keep such interaction to the minimum.
There are other aspects of the Xuantie IOMMU specification that are worth highlighting. With the increasing demand on cloud computing, it is crucial for the host machines to expand scalability allowing for heterogeneous virtual machines. The traditional approach is to perform PASID lookup after the G-stage translation tables are determined. It maps a device function entirely to one virtual machine, which significantly limits the scalability (i.e. less than 256). In the Xuantie IOMMU specification, the PASIDs are included as parts of the to-be-expanded device identifiers, while being taken into account in the determination of the G-stage translation. As a result, a virtual machine does not monopolize device functions, hence more device space is made available for virtualization, hence allowing for more heterogeneous virtual machines at the same time.
The control interface of the Xuantie IOMMU is carefully separated into a 2-page aligned MMIO segments. One is used for ‘data’ registers while the other for command registers. The goal of this design is to facilitate virtualizing the IOMMU. Separation of data registers allows for easier emulation. Host hypervisor can allocate and fill a page with data presented by the virtual IOMMU to the guest VM. In addition, the command register range is still emulated by the trap-n-emulate approach. This design eliminates traps when accessing data of the virtual IOMMU, improving performance.
Xuantie IOMMU also has incorporated security as an important feature. PCIe ATS allows a device to mark DMA requests as ‘translated’. The IOMMU will not translate these DMA requests, i.e. no permission checks. If a malicious PCIe device is present, it can mark all DMA requests as translated, effectively bypassing all permission checks. This attack is demonstrated in academic papers and we feel it is about time to fix it. Therefore, Xuantie IOMMU specification proposes to work with the ongoing IOPMP effort to construct in-depth defense while minimizing design and implementation changes. There is ongoing discussion about this topic in the community.
As of now, we have implemented the IOMMU specification on the popular QEMU emulator. Currently the prototype incorporates a Linux kernel space driver. We have added the support for the VFIO framework and we are working on virtualization support. We have released the source code of the emulation platform for the interest of others utilizing IOMMU to test solutions.
What to expect in the future? The IOMMU Task Group will deliver any future updates. Our specification and other proposals may converge into a common proposal for the RISC-V community, during which many advanced features will be considered and discussed. Our QEMU work will stay closely informed with any new development on the community specification.