Author: Chong Ren
Recently, T-Head has completed the QEMU-based proof-of-concept of hardware support for virtual IOMMU for virtual machines, based on the specification in the T-Head IOMMU proposal submitted to the IOMMU TG at its time of inception as one of the candidate proposals. The T-Head IOMMU’s design of virtual IOMMU showcased a method to expose to the virtual machines a virtual IOMMU identical to the one used by the host. The benefit is that the exact same kernel driver can be directly re-used by the guest virtual machine, and that costly software-based emulation in the conventional solution is eliminated.
For reasons such as performance or security, physical I/O devices can be configured to be directly accessed by the guest virtual machine, a technique widely known as device passthrough. The passthrough devices are restricted by the IOMMU’s translation tables so that they can only perform DMA to memory regions that belong to the virtual machine that they are assigned to. From the virtual machine’s perspective, the passthrough devices appear as peripheral devices that directly accesses the virtual machine’s physical address space. Without incorporating an IOMMU for the virtual machine’s use, the virtual machine suffers from the same inconveniences and drawbacks as in the scenario when all devices in the host are not managed by IOMMU. They cannot be further assigned to the virtual machine’s user space, nor can they be restricted for reliability purposes.
The conventional solution, shown in Figure 1, to present the virtual machine an IOMMU is either trap-n-emulate or paravirtualization. Trap-n-emulate is expensive. Although it presents the same IOMMU as the hardware IOMMU used by the host to the guest virtual machine, however, the guest’s access to the virtual IOMMU triggers exceptions that are handled by the host. The handling is expensive. The host not only needs to emulate accesses to the virtual IOMMU registers, but also need to combine the two stages of translation whenever the virtual machine modifies its memory-resident translation structures. The latter is due to the fact that the existing hardware IOMMU does not directly use the guest’s translation tables because the hardware only support one stage of address translation. Some IOMMU, e.g., the ARM SMMU v3, can perform nested address translation and there are kernel patches to directly use guest’s translation tables, however, the patches remain as RFCs, presumably due to complex software interaction caused by the table structure defined by the hardware architecture.
Paravirtualization, shown in Figure 2, reduces the emulation efforts by requiring the guest virtual machine to explicit communicate its IOMMU configurations to the host. The greatest drawback is that the guest and host need to be modified, therefore, it might not be available in certain environments.
T-Head’s Hardware Support for Virtual IOMMU
T-Head’s IOMMU proposal attempts to address the above drawbacks starting from the hardware architecture. The brief idea is to designate a memory region, called the state region, for the register states of the virtual IOMMU presented to the guest virtual machine. Meanwhile, the host’s table structure includes a pointer to the state region. When a DMA request needs to be translated, the IOMMU locates the state region, from which it fetches the the translation tables and the state of the virtual IOMMU configured by the guest virtual machine. Subsequently the hardware IOMMU walks the guest’s table structures in the same manner as the host’s structures, treating all addresses as guest physical addresses, i.e., in a nested translation fashion.
T-Head’s virtual IOMMU, shown in Figure 3, avoids expensive emulation since the guest’s configuration is directly used by the hardware. That is, the guest is interacting with a ‘passthrough’ IOMMU supported by hardware. The interface to the IOMMU is directly exposed by the hardware IOMMU; it is the same as the host’s IOMMU. The exact same driver used by the host can be directly reused. Using memory for storing the virtual IOMMU’s stage makes the solution scalable, not subjecting to the resource constraint on the registers.
We have finished the proof-of-concept on QEMU and Linux/KVM. We added support for nested translation according to T-Head’s IOMMU Specification to the IOMMU emulation code in the native QEMU. We built on our previous device passthrough work and added nested IOMMU support to the VFIO layer in the RISC-V QEMU. The IOMMU kernel driver exposes a new API for the RISC-V QEMU to manage the state region and configure the device ID in the translation descriptor, the new API exists as a device file called /dev/xt_iommu and we overrode the mmap and write handler on the file.
We will continue to evaluate and improve the current prototype and design, including implementing it in RTL. At appropriate time we’d like contribute this solution to the larger RISC-V community.