About our Dhrystone Benchmarking Methodology

This is an extended version of Krste’s comment on the RISC-V EE Times article about our Dhrystone benchmarking methodology.

We have reported a Dhrystone score of 1.72 DMIPS/MHz for the Rocket core here. We pulled the Dhrystone comparison together quickly, as we kept getting asked about how we compared to ARM cores and these were the only publicly available numbers we could easily compare against. We didn’t spent a lot of time on it, as we’re not particularly interested in “Dhrystone Drag Racing” with minimal stripped-down cores. Basically, we sized the caches to match ARM’s configuration publicly available on their website and removed our vector floating-point unit to make a fairer comparison with the A5, which was also configured without an FPU or vector unit. We didn’t strip out a lot of other stuff that we could have.

Here’s a more detailed specification of the Rocket core we’ve used:

  • The Rocket core implements RV64IMA, i.e., base integer, integer multiply/divide, and atomic operations (which are quite extensive in RISC-V and go unused in Dhrystone). Our registers are twice as wide (64 vs 32) and we have twice as many user registers (32 versus 16) as ARM. The 64-bit width does help Dhrystone, but also lots of other code, and they are obviously included in our area number.
  • The instruction cache was 16KB 2-way set-associative with 64-byte lines, blocking on misses.
  • The data cache was also 16KB 2-way set-associative with 64-byte lines, but because it was designed to work with our high-performance vector unit, it’s non-blocking with 2 MSHRs, 16 replay-queue entries, and 17 store-data queue entries. None of these help Dhrystone, which never misses in the caches.
  • When we compared numbers with and without caches, we weren’t sure what ARM left out, so we only removed the SRAM tag and data arrays and left in all of the above cache control logic in our core area. The MMU has 8 ITLB and 8 DTLB entries, fully associative, and the MMU has a hardware page-table walker. Obviously, the hardware page table walker doesn’t help Dhrystone.
  • The branch prediction hardware is a BTB with 64 entries, a BHT with 128 entries, and a RAS with 2 entries. This amount of branch prediction helps Dhyrstone, but helps a lot of other code, too.

We made sure to adhere to the guidelines in ARM’s Dhrystone benchmarking methodology when compiling the code.  More specifically, we use the following lines to invoke the compiler:

$ riscv-gcc -c -O2 -fno-inline dhry_1.c
$ riscv-gcc -c -O2 -fno-inline dhry_2.c
$ riscv-gcc -o dhrystone dhry_1.o dhry_2.o

The disassembled benchmark code can be found here.

Our standard C library does include hand-optimized assembly, and does make use of all 64-bits (of course!), but we also did this for functions not used by Dhrystone, as an optimized standard library helps all code.

Like ARM, we didn’t actually fabricate this version of the RTL, but we have fabricated and measured enough variants in different processes to be confident in our layout results.

Overall, we’re pretty sure it’s a reasonable comparison, though we’re not completely sure about all the details in ARM’s result to make sure we’re being fair. But you don’t need to take our word for it: you can checkout our Rocket core generator and replicate the same experiment on your end!

Tags: