Run 32-bit applications on 64-bit Linux kernel | LIU Zhiwei, GUO Ren| T-Head Division of Alibaba Cloud

1. Introduction

Many architectures support run 32-bit applications on 64-bit processors. On x86, in order to achieve running 16-bit or 32-bit applications in long mode, the processor clears the L bit from code segment descriptors of the applications[1].

This mode is named COMPAT mode on Linux. Here are the two general reasons in utilizing COMPAT[2]:

To be compatible with existing binaries and distros
To reduce the memory footprint of user space in a memory-constrained environment, either deeply embedded or in a container.

The architecture of RISC-V is quite different from those of many legacy 16-bit or 32-bit software. RISC-V does not have too much 32-bit legacy applications (if exist) because the 32-bit GLIBC was not released until 2020. RISC-V is supposed to be unique. The RISC-V architecture supports a plethora application, ranging from small microcontrollers to mega servers. In the near future, more RV32 applications will be developed, so CPUs that support both RV32 and RV64 applications are expected to gain prominence.

Currently, RV64 Linux is better supported than RV32 Linux. Therefore, we can now use RV64 to run RV32 applications.

COMPAT mode added an ABI emulation layer to the original design, which is supported by hardware and OS. Hardware implementation is modified to run 32-bit instructions, while the OS should provide 32-bit address space for applications, system calls, and other required supports. In Chapter 2 we describe the related hardware mechanism: dynamic XLEN for RISC-V. Moreover, we explain further on implementation of dynamic UXLEN on QEMU in Chapter 3. We finish by introducing the COMPAT mode on Linux in Chapter 4.

2. Dynamic XLEN

We classify XLENs into different types based on RISC-V specifications, i.e. UXLEN, SXLEN, MXLEN, VSXLEN, and HSXLEN. Dynamic XLEN is the mechanism designed to resolve compatibility problems on RISC-V. Every privilege mode has its own XLEN and can be changed independently.

Here we will describe further about XLEN and their influences on CPUs.

2.1 What is XLEN?

XLEN specifies the width of an integer register in bits.

It determines 2XLEN bytes address space;
It sets the width of CSRs, which is measured in XLEN bits, such as mstatus;
It influences the behavior of instructions. It causes instructions, such as ADD, which have the same name and encoding, but different behaviors in RV32 and RV64.

2.2 Specification Rules

2.2.1 Instruction behaviors

When the XLEN configured is smaller than the widest supported XLEN,

All operations ignore the source operand register bits above the configured XLEN;
Sign extension must be performed to extend the result to the widest supported XLEN for the destination register;
PC bits above the XLEN are ignored. When PC is in the process of being written, it is sign-extended to fill the widest supported XLEN.

2.2.2 CSR width modulation

In the specification, there is a detailed description of CSR width modulation [3].Here are some important highlights:

The CSR value is zero extended (from small XLEN to large) or truncated (from large to small);
It is uniquely specified.

2.3 Change XLEN

2.3.1 MXLEN

Encoded in MXL field in misa.
The MXL field is always set to the widest supported ISA variant at reset.

2.3.2 SXLEN/UXLEN

Encoded in SXL/UXL field in mstatus.
Same encoding as MXLEN

2.4 Not clarified rules

2.4.1 XLEN change is not seamless

It is important to understand that the dynamic change rules were not written to facilitate actively changing the effective XLEN for software “on the go”[4]. Rather, the expectation is that whatever software was running before is likely “done”, and something else will now be run, without much carryover in the CSRs between the two, if indeed any.

2.4.2 MXLEN >= SXLEN >= UXLEN

The behavior of a RISC-V program depends on the execution environment on which it runs. When the program calls a service exposed by more privileged mode, its context needs to be correctly switched. For example, the context of a Linux application includes general registers, the PC register, some U-mode available CSRs and address space.

The context must be correctly saved and restored for every system call. Although the S-mode OS can recognize U-mode XLEN for every program that triggers system calls, it does not recognize the whole 64-bit registers if UXLEN > SXLEN (For example, UXLEN=64 and SXLEN=32). Thus, in this situation, S-mode OS cannot save the context of U-mode application correctly.

3. QEMU

QEMU is a quick emulator for function model. QEMU uses binary translation technology to simulate CPU, devices, memory, and SOC. It can be used as an experimental platform for COMPAT test due to its support to dynamic UXLEN.

3.1 Challenges

QEMU has supported RISC-V since the release of V2.12 in 2018. It holds up both RV32 and RV64. However, it doesn’t provide help to the dynamic switch of RV32 and RV64. The binary qemu-system-riscv32 only works for RV32 and qemu-system-riscv64 only for RV64.

3.1.1 Macros

There are many TARGET_RISCV32 and TARGET_RISCV64 macros that are used to isolate codes that are built only for a special XLEN. That makes it impossible to work for different XLENs. For example, addw is only available for RV64.

#ifdef TARGET_RISCV64
static bool trans_addw(DisasContext *ctx, arg_addw *a)
{
    return gen_arith(ctx, a, &gen_addw);
}
#endif

Macros are also used to control the behaviors of instructions.

static bool trans_fmv_x_w(DisasContext *ctx, arg_fmv_x_w *a)
{
    /* NOTE: This was FMV.X.S in an earlier version of the ISA spec! */
    REQUIRE_FPU;
    REQUIRE_EXT(ctx, RVF);

    TCGv t0 = tcg_temp_new();

    #if defined(TARGET_RISCV64)
    tcg_gen_ext32s_tl(t0, cpu_fpr[a->rs1]);
    #else
    tcg_gen_extrl_i64_i32(t0, cpu_fpr[a->rs1]);
    #endif

    gen_set_gpr(a->rd, t0);
    tcg_temp_free(t0);

    return true;
}

3.1.2 Width of register and TCG IR bind with binary

The RISC-V QEMU uses TARGET_LONG (which is specified by configuration options) to control general register width and the TCG IR.

struct CPURISCVState {
    target_ulong gpr[32]；
        <clip>
    }

64-bit TCG IR is used for RV64I while 32-bit TCG IR is used for RV32I. QEMU can compile different programs for RV32 and RV64. However, if you run the same QEMU binary for RV32 program and RV64 program, issues will occur.

When the current XLEN is 32, it may overflow for ADD when uses the add_i64 TCG IR. In this case, the LSB 32 bits of destination register is correct. However, if it is succeeded by an SRA, then the overflows bits will cause the LSB 32 bits incorrect.

3.1.3 Memory address exceeds 4 GB

Another problem is how to calculate the address of memory access. For RISC-V, the memory address is calculated based on a base address and an immediate offset.

static bool gen_load(DisasContext *ctx, arg_lb *a, MemOp memop)
{
    TCGv t0 = tcg_temp_new();
    TCGv t1 = tcg_temp_new();
    gen_get_gpr(t0, a->rs1);
    tcg_gen_addi_tl(t0, t0, a->imm);

    tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, memop);
    gen_set_gpr(a->rd, t1);
    tcg_temp_free(t0);
    tcg_temp_free(t1);
    return true;
}

When XLEN is 32-bit on a 64-bit CPU, the base address coming from a register may exceed the 232 bytes addressable space. The maximum address space of a 32-bit application is 232 bytes.

3.1.4 CSR restoration errors

The width of CSRs is also in TARGET_ULONG bits. For U-mode CSRs, if the positions of some fields are related to XLEN, context switch problems may occur.

For example, the width of VTYPE is XLEN bits, and the VILL field is in the XLEN-1 bit. If UXLEN is 32 bits and SXLEN is 64 bits, the VILL field in U-mode VTYPE will not locate at the corresponding bit in S-mode. When the OS saves context, the OS uses CSRR to read VTYPE to a general register and then store it to the stack. However, a VTYPE that is different from the U-mode application is generated due to the change of VILL location.

3.1.5 Other problems

Many other problems may occur during disassembling, debugging, and PC calculation. They should be carefully processed for 32-bit and 64-bit UXLEN.

3.2 Design

3.2.1 Represent XLEN correctly

We have the maximum (processor instantiation time) register size in misa_mxl_max. The current register size is in xl, and the current instruction size is in ol.

3.2.1.1 misa_mxl_max

misa_mxl_max stands for the widest supported ISA variant. As the specification ruled, “The MXL field is always set to the widest supported ISA variant at reset.” It can be used in two scenarios, booting and GDB, intentionally to emphasize the use of the reset value of misa.mxl, and not the current CPU state.

3.2.1.2 xl

The target CPUs may be in different states. CPUs in different states evaluate instructions differently. In order to achieve a high speed, state information of virtual CPUs cannot be changed in the translation phase. The state is recorded in the Translation Block (TB). If the state changes (e.g. privilege level), a new TB will be generated. The same idea can be applied to other aspects of the CPU state. For example, on x86, if the SS, DS and ES segments have a zero base, then the translator does not generate an addition for the segment base.[5]

The states recorded in the TB are named TBFlags. As XLEN explicitly changes the way to evaluate instructions, the state information is stored as a part of TBFlags.

flags = FIELD_DP32(flags, TB_FLAGS, XL, cpu_get_xl(env));

The calculation of the current XLEN is complicated. We should not repeated recalculation.

static RISCVMXL cpu_get_xl(CPURISCVState *env)
{
#if defined(TARGET_RISCV32)
    return MXL_RV32;
#elif defined(CONFIG_USER_ONLY)
    return MXL_RV64;
#else
    RISCVMXL xl = riscv_cpu_mxl(env);

    /*
     * When emulating a 32-bit-only cpu, use RV32.
     * When emulating a 64-bit cpu, and MXL has been reduced to RV32,
     * MSTATUSH doesn't have UXL/SXL, therefore XLEN cannot be widened
     * back to RV64 for lower privs.
     */
    if (xl != MXL_RV32) {
        switch (env->priv) {
        case PRV_M:
            break;
        case PRV_U:
            xl = get_field(env->mstatus, MSTATUS64_UXL);
            break;
        default: /* PRV_S | PRV_H */
            xl = get_field(env->mstatus, MSTATUS64_SXL);
            break;
        }
    }
    return xl;
#endif

However, the current XLEN in TBFlags can be calculated many times, as TBFlags are calculated before the TB (except the directly chaining block) executes. XLEN changes seldom. XLEN is changed during exceptions, misa write, mstatus write, and CPU reset operations, as well as load migration. We can recalculate XLEN in these cases and cache XLEN into CPURISCVState.

3.2.1.3 ol

‘ol’ is associated with the instruction being translated, suffixed with ‘w’ in RV64 and suffixed with ‘w’ and ‘d’ in RV128. For each ol variant, we provide code to implement the variant.

For example, you want to translate the srli. If ol is 32-bit, it will use gen_srliw. Meanwhile, tcg_gen_shri_tl is used for 64-bit applications or gen_srli_i128 is used for 128-bit applications.

static bool trans_srli(DisasContext *ctx, arg_srli *a)
 {
     return gen_shift_imm_fn_per_ol(ctx, a, EXT_NONE,
                                    tcg_gen_shri_tl, gen_srliw, gen_srli_i128);
 }

When you translate the srliw and xl is 64-bit, you can reuse the gen_srliw by setting ol to MXL_RV32.

static bool trans_srliw(DisasContext *ctx, arg_srliw *a)
 {
     REQUIRE_64_OR_128BIT(ctx);
     ctx->ol = MXL_RV32;
     return gen_shift_imm_fn(ctx, a, EXT_NONE, gen_srliw, NULL);
 }

3.2.2 Use 64-bit TCG IR to represent the 32-bit calculation

In general, there are two ways to evaluate the 32-bit instructions.

Truncate the 64-bit register to 32-bit and use 32-bit TCG IR.
Extend the 32-bit register to 64-bit and use 64-bit TCG IR.

This is the first discussion on the implementation of 32-bit calculation. The conclusion is that the 64-bit TCG IR can implement the 32-bit calculation by performing some extra operations. It reuses the core calculation code for most instructions. For some other instructions, the truncate method is also used.

Sign- or zero-extension is used to extend the 32-bit register values to 64-bit values, and then use the 64-bit TCG IR to calculate the result.

static TCGv get_gpr(DisasContext *ctx, int reg_num, DisasExtend ext)
 {
     TCGv t;
     switch (get_ol(ctx)) {
     case MXL_RV32:
         switch (ext) {
         case EXT_NONE:
             break;
         case EXT_SIGN:
             t = temp_new(ctx);
             tcg_gen_ext32s_tl(t, cpu_gpr[reg_num]);
             return t;
         case EXT_ZERO:
             t = temp_new(ctx);
             tcg_gen_ext32u_tl(t, cpu_gpr[reg_num]);
             return t;
         default:
             g_assert_not_reached();
         }
         break;
     case MXL_RV64:
     case MXL_RV128:
         break;
     default:
         g_assert_not_reached();
     }
     return cpu_gpr[reg_num];
 }

Sign-extension is used to extend 32-bit results back to the register.

static void gen_set_gpr(DisasContext *ctx, int reg_num, TCGv t)
 {
     if (reg_num != 0) {
         switch (get_ol(ctx)) {
         case MXL_RV32:
             tcg_gen_ext32s_tl(cpu_gpr[reg_num], t);
             break;
         case MXL_RV64:
         case MXL_RV128:
             tcg_gen_mov_tl(cpu_gpr[reg_num], t);
             break;
         default:
             g_assert_not_reached();
         }
     }
 }

For many instructions, we can get the right results in this way.

Source registers do not need special extend, such as ADD.
Source registers need sign-extend, such as SLT and SLTI.
Source registers need zero-extend, such as SRL.

There are other instructions that we cannot translate in the same way, such as the rori. We should truncate the values to 32-bit values and use 32-bit IR.

 static void gen_roriw(TCGv ret, TCGv arg1, target_long shamt)
 {
     TCGv_i32 t1 = tcg_temp_new_i32();
 
     tcg_gen_trunc_tl_i32(t1, arg1);
     tcg_gen_rotri_i32(t1, t1, shamt);
     tcg_gen_ext_i32_tl(ret, t1);
 
     tcg_temp_free_i32(t1);
 }

You may be concerned about the precision of the results, as we are calculating a 32-bit target instruction by using 64-bit IR. The problem does not occur because integer operations have no precision problems as long as no value changes when extension is performed.

3.2.3 Calculation according to the specification

There are many issues according to the specification. The following figure shows the changes.

3.3 Status

3.3.1 The upstream process

3.3.2 Current status

The dynamic UXLEN supports patch set v8 (23 patches) that has been merged into QEMU v7.0.0.
Both Richard MXL and Frédéric Pétrot RV128 patch sets have been merged.
Dynamic UXLEN is ready.
Neither dynamic SXLEN nor MXLEN is supported. If supported, we can remove RV32 QEMU.

4. Linux Support

Compatibility layer support has a long history. Currently, x86, parisc, powerpc, arm64, s390, mips, and sparc in Linux 64 all support COMPAT. RISC-V has become the 8th architecture to support this feature and RISC-V is the first architecture that implements COMPAT based on asm-generic/unistd.h.

4.1 Implement compat syscall table

The core purpose of the compatibility layer is to construct a compat syscall table for 32-bit applications on 64-bit OS.

4.1.1 Principle from the kernel documentation

For most system calls, the same 64-bit implementation method can be invoked even when user space program is 32-bit; even if parameters that are specified for system calls include an explicit pointer, the problem is handled transparently[6].

However, there are a couple of situations where a compatibility layer is needed to cope with size differences between 32-bit and 64-bit.

If the 64-bit kernel supports 32-bit user space programs and needs to parse areas of (__user) memory that could hold either 32-bit or 64-bit values. A compatibility layer is needed whenever a system call argument is:

a pointer to a pointer
a pointer to a struct containing a pointer (e.g. struct iovec __user *)
a pointer to a varying sized integral type (time_t, off_t, long, …)
a pointer to a struct containing a varying sized integral type.

If one of the system calls includes an argument type that is explicitly 64-bit on a 32-bit architecture, for example loff_t or __u64. In this case, a value that arrives at a 64-bit kernel from a 32-bit application will be split into two 32-bit values, which then need to be re-assembled in the compatibility layer.

4.1.2 RISC-V implementation

4.1.2.1 Implement double TASK_SIZE

Make TASK_SIZE from const to dynamically detect the TIF_32BIT flag function. Refer to arm64 for implementing DEFAULT_MAP_WINDOW_64 for efi-stub. Limit 32-bit compatible process in the 0 GB – 2 GB virtual address range (which is enough for real scenarios) to avoid address sign extend problems when 32-bit values are converted to 64-bit values to ease software design. The standard 32-bit TASK_SIZE is 0x9dc00000 which is the value of FIXADDR_START. Compared with the size of virtual address space in compatible 32-bit mode, the size of virtual address space in standard 32-bit mode is 476 MB larger.

#ifdef CONFIG_COMPAT
#define TASK_SIZE_32 (_AC(0x80000000, UL) - PAGE_SIZE)
#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
TASK_SIZE_32 : TASK_SIZE_64)
#else

4.1.2.2 Implement compat_sys_call_table

Implement compat sys_call_table and some system call functions: truncate64, ftruncate64, fallocate, pread64, pwrite64, sync_file_range, readahead, and fadvise64_64 which need argument translation.

#ifdef CONFIG_COMPAT
#define __ARCH_WANT_COMPAT_TRUNCATE64
#define __ARCH_WANT_COMPAT_FTRUNCATE64
#define __ARCH_WANT_COMPAT_FALLOCATE
#define __ARCH_WANT_COMPAT_PREAD64
#define __ARCH_WANT_COMPAT_PWRITE64
#define __ARCH_WANT_COMPAT_SYNC_FILE_RANGE
#define __ARCH_WANT_COMPAT_READAHEAD
#define __ARCH_WANT_COMPAT_FADVISE64_64
#endif

void * const compat_sys_call_table[__NR_syscalls] = {
[0 ... __NR_syscalls - 1] = sys_ni_syscall,
#include <asm/unistd.h>
};

4.1.2.3 Implement compat_syscall exception handler in entry.S

Implement the entry of compat_sys_call_table[] in asm. For more information, see riscv-privileged specification 4.1.1 Supervisor Status Register (sstatus):

BIT [32:33] = UXL [1:0]:

– 1:32

– 2:64

– 3:128.

#ifdef CONFIG_COMPAT
REG_L s0, PT_STATUS(sp)
srli s0, s0, SR_UXL_SHIFT
andi s0, s0, (SR_UXL >> SR_UXL_SHIFT)
li t0, (SR_UXL_32 >> SR_UXL_SHIFT)
sub t0, s0, t0
bnez t0, 1f



/* Call compat_syscall */
la s0, compat_sys_call_table
j 2f
1:
#endif
/* Call syscall */
 la s0, sys_call_table
2:
 slli t0, a7, RISCV_LGPTR
 add s0, s0, t0
 REG_L s0, 0(s0)
3:
 jalr s0

4.1.2.4 Implement compat_signal

Implement compat_setup_rt_frame to save and restore sigcontext. The main process is the same as the process of signal, but the rv32 pt_regs’ size is different from rv64’s, so we need to convert them.

struct compat_sigcontext {
struct compat_user_regs_struct sc_regs;
union __riscv_fp_state sc_fpregs;
};

struct compat_ucontext {
compat_ulong_t  uc_flags;
struct compat_ucontext *uc_link;
compat_stack_t  uc_stack;
sigset_t  uc_sigmask;
__u8    __unused[1024 / 8 - sizeof(sigset_t)];
struct compat_sigcontext uc_mcontext;
};

struct compat_rt_sigframe {
struct compat_siginfo info;
struct compat_ucontext uc;
};
COMPAT_SYSCALL_DEFINE0(rt_sigreturn)；

4.1.2.5 Implement compat_ptrace

Native GDB depends on the ptrace system call. Currently, RISC-V doesn’t have special process for ptrace system call because Linux does not support debugging trigger mechanisms based on the specifications.

ptrace should return 32-bit registers for applications.

static int compat_riscv_gpr_get(struct task_struct *target,
const struct user_regset *regset,
struct membuf to)
{
struct compat_user_regs_struct cregs;
regs_to_cregs(&cregs, task_pt_regs(target));
return membuf_write(&to, &cregs,
sizeof(struct compat_user_regs_struct));
}

4.1.2.6 Other works

Cleanup SYSCALL common code with community maintainers (Thx Arnd, Christoph)
Create the base COMPAT data structure
Implement double ELF detect
Implement compat_vdso

4.2 Run rv64-compat on QEMU

 - Prepare rv32 rootfs & fw_jump.bin by buildroot.org
   $ git clone git://git.busybox.net/buildroot
   $ cd buildroot
   $ make qemu_riscv32_virt_defconfig O=qemu_riscv32_virt_defconfig
   $ make -C qemu_riscv32_virt_defconfig
   (Got fw_jump.bin & rootfs.ext2 in qemu_riscv32_virt_defconfig/images)

 - Prepare Linux rv64 Image
   $ git clone git@github.com:c-sky/csky-linux.git -b riscv_compat_v12 linux
   $ cd linux
   $ echo "CONFIG_STRICT_KERNEL_RWX=n" >> arch/riscv/configs/defconfig
   $ echo "CONFIG_STRICT_MODULE_RWX=n" >> arch/riscv/configs/defconfig
   $ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- O=../build-rv64/ defconfig
   $ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- O=../build-rv64/ Image

 - Prepare Qemu:
   $ git clone https://gitlab.com/qemu-project/qemu.git -b master linux
   $ cd qemu
   $ ./configure --target-list="riscv64-softmmu"
   $ make

 - Run rv64 with rv32 rootfs in compat mode:
   $ ./build/qemu-system-riscv64 -cpu rv64 -M virt -m 64m -nographic -bios qemu_riscv64_virt_defconfig/images/fw_jump.bin -kernel build-rv64/Image -drive file qemu_riscv32_virt_defconfig/images/rootfs.ext2,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -append "rootwait root=/dev/vda ro console=ttyS0 earlycon=sbi" -netdev user,id=net0 -device virtio-net-device,netdev=net0

5. References

Advanced Micro Devices, Inc. x86-64TM Technology White Paper.
Arnd Bergmann. Linux kernel mail list comments, https://patchwork.kernel.org/project/linux-riscv/cover/20211221163532.2636028-1-guoren@kernel.org/#24681423.
Andrew Waterman & Krste (2021). RISC-V ISA specification, Volume 2, Privileged Spec v. 20211203, (pp. 13–14).
John Hauser. RISC-V tech mail list, https://lists.riscv.org/g/tech/message/252.
QEMU TCG documentation, https://qemu.readthedocs.io/en/latest/devel/tcg.html
Linux kernel documentation, https://www.kernel.org/doc/html/latest/process/adding-syscalls.html?highlight=syscall