Conventional wisdom says you should normally apply small microcontrollers to dedicated applications with constrained resources. 8-bit microcontrollers with a few kilobytes of memory are still plentiful today. 32-bit microcontrollers with a couple of dozen kilobytes of memory are also very popular. In the latter case, it is typical to rely on a small RTOS to provide basic software interfaces and services. The Zephyr Project provides such an RTOS. Many ARM-based microcontrollers are supported, but other architectures including ARC, XTENSA, RISC-V (32-bit) and X86 (32-bit) are also supported. Yet some people are designing products with computing needs that are simple enough to be fulfilled by a small RTOS like Zephyr, but with memory addressing needs that cannot be described by kilobytes or megabytes, but that actually require gigabytes! So it was quite a surprise when BayLibre was asked to port Zephyr to the 64-bit RISC-V architecture.
Where to start
The 64-bit port required a lot of cleanups. Initially, we were far from concerned by the actual RISCV64 support. Zephyr supports a virtual “board” configuration faking basic hardware on one side and interfacing with a POSIX environment on the other side which allows for compiling a Zephyr application into a standard Linux process. This has enormous benefits such as the ability to use native Linux development tools. For example, it allows you to use gdb to look at core dumps without fiddling with a remote debugging setup or emulators such as QEMU. Until this point, this “POSIX” architecture only created 32-bit executables. We started by only testing the generic Zephyr code in 64-bit mode. It was only a matter of flipping some compiler arguments to attempt a 64-bit build. But unsurprisingly, it failed.The 32-bit legacy
Since its inception, the Zephyr RTOS targeted 32-bit architectures. The assumption that everything can be represented by an int32_t variable was everywhere. Code patterns like the following were ubiquitous:``` static inline void mbox_async_free(struct k_mbox_async *async) { k_stack_push(&async_msg_free, (u32_t)async); } ```Here the async pointer gets truncated on a 64-bit build. Fortunately, the compiler does flag those occurrences:
``` In function ‘mbox_async_free’: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] k_stack_push(&async_msg_free, (u32_t)async); ^ ```Therefore the actual work started with a simple task: converting all u32_t variables and parameters that may carry pointers into uintptr_t. After several days of work, the Hello_world demo application could finally be built successfully. Yay! But attempting to execute it resulted in a segmentation fault. The investigation phase began.
Chasing bugs
While the compiler could identify bad u32_t usage when a cast to or from a pointer was involved, some other cases could be found only by manual code inspection. Still, Zephyr is a significant body of code to review and catching all issues, especially the subtle ones, couldn’t happen without some code execution tracing in gdb. A much more complicated issue involved linked list pointers that ended up referring to non-existent list nodes for no obvious reason, and the bug only occurred after another item was removed from the list. This issue was only noticeable with a subsequent list search that followed the rogue pointer into Lalaland. And it didn’t trigger every time. The header file for list operations starts with this:``` #ifdef __LP64__ typedef u64_t unative_t; #else typedef u32_t unative_t; #endif ```So one would quickly presume that the code is already 64-bit ready. From a quick glance, it does use unative_t everywhere. What is easily missed is this:
``` #define SYS_SFLIST_FLAGS_MASK 0x3U static inline sys_sfnode_t *z_sfnode_next_peek(sys_sfnode_t *node) { return (sys_sfnode_t *)(node->next_and_flags & ~SYS_SFLIST_FLAGS_MASK); } ```Here we return the next pointer after masking out the bottom 2 flag bits. But 0x3U is interpreted by the compiler as an unsigned int and therefore a 32-bit value, meaning that ~0x3U is equal to 0xFFFFFFFC. Because node->next_and_flags is an u64_t, our (unsigned) 0xFFFFFFFC is promoted to 0x00000000FFFFFFFC, effectively truncating the returned pointer to its 32 bottom bits. So everything worked when the next node in the list was allocated in heap memory which is typically below the 4GB mark, but not for nodes allocated on the stack which is typically located towards the top of the address space on Linux. The fix? Turning 0x3U into 0x3UL. The addition of that single character required many hours of debugging, and this is only one example. Other equally difficult bugs were also found.