



# Processor Trace in a Holistic World DAC-2018 San Francisco RISC-V Foundation Booth





## Processor Trace in a Holistic World SoC DAC-2018 San Francisco RISC-V Foundation Booth





# Processor Debug, Analytics and Trace in a Holistic SoC

DAC-2018 San Francisco RISC-V Foundation Booth





## Debug, Analytics and Trace in a Holistic SoC DAC-2018 San Francisco RISC-V Foundation Booth





# Post-Silicon ! Debug, Analytics and Trace in a Holistic SoC DAC-2018 San Francisco RISC-V Foundation Booth





- Overview
- Holistic SoC
- Architectural Overview
- Example Scenarios
- RISC-V Processor Branch Trace
- Summary





- Founded 2009
- VC-funded start-up
  - 2017 Round of \$7M
  - New Chairman October 2017 Alberto Sangiovanni-Vincentelli
- Headquarters in Cambridge UK
- 44 patents
- ~35 employees
- Seasoned management team
- Key partners & ecosystem
- Proven technology and product-market fit 14 May 2018









### A coherent architecture to debug, develop, optimize & secure

- Full SoC visibility, HW & SW. System level
- Support "all" processor architectures: Freedom of IP selection
- Real-time & non-intrusive
- Advanced analytics & forensics
- Power/Performance optimization
- "in life" analytics & SLA compliance
- Supports Functional Safety
- Supports Bare Metal Security<sup>™</sup>
- High-speed debug over USB or SerDes

**On-chip Analytics and Debug via RTL IP and SW** 







- Overview
- Holistic SoC
- Architectural Overview
- Example Scenarios
- RISC-V Processor Branch Trace
- Summary







# ho·lis·tic /hōˈlistik/ •

adjective PHILOSOPHY

characterized by comprehension of the parts of something as intimately interconnected and explicable only by reference to the whole.



ultrasoc

"The first true SOC appeared in a Microma watch in 1974 when Peter Stoll integrated the LCD driver transistors as well as the timing functions onto a single Intel 5810 CMOS chip."

http://www.computerhistory.org





## Modern SoC - Nvidia Carmel





05/07/2018

256-Bit LPDDR4 137 GB/s







05/07/2018



### SoC Observability has Always Been an Issue







05/07/2018





- Overview
- Holistic SoC
- Architectural Overview
- Example Scenarios
- RISC-V Processor Branch Trace
- Demo Summary











### Software Tools for Data Driven Analysis





#### Third Party Tool **Vendor Partnerships**

UltraDevelop interfaces with almost all common validation and verification solutions:





05/07/2018







- Overview
- Holistic SoC
- Architectural Overview
- Example Scenarios
- **RISC-V Processor Branch Trace**
- Demo Summary







- Detailed presentation at: <u>https://riscv.org/2018/05/risc-v-workshop-in-barcelona-proceedings/</u>
- In complex systems understanding program behaviour is not easy under realtime conditions
- Software/Firmware rarely behave as planned interactions with other cores' software, peripherals, real-time events, poor implementation, ...
- Using a debugger is not always possible since realtime behaviour can be affected
- Providing visibility of program execution real-time is important
- One method of achieving this is via Processor Branch Trace
  - Encoder
  - Filtering and triggering schemes





- Track execution from a known address and send messages about deltas taken by program
- Deltas result from *jump, call, return* and *branch* type instructions, interrupts and exceptions
- RISC-V instructions are executed unconditionally or their execution can be determined based on the program. Instructions between the deltas are assumed to be executed sequentially
- No need to report them via the trace, only whether the branch was taken or not and the address of indirect branches or jumps taken





- Interrupts can occur asynchronously to a program's execution
- Exceptions can be thought of in the same way
- The decoder generally does not know where an interrupt occurs in the instruction sequence. The trace encoder reports the address and an indication of the asynchronous destination
- When an interrupt or exception occurs, or the processor is halted, the final instruction executed must be traced





- Controlling when trace is generated is critical for reducing the volume of trace data
- Filters are required
- Filters enable trace based on: address range, privilege level, and interrupt service routines
- Other examples
  - Trace for fixed period of time
  - Start trace when external (to the encoder) event detected
  - Stop trace when an external (to the encoder) event detected





| Benchmark   | Instructions | Packets | Payload<br>Bytes | Bits per<br>instruction |
|-------------|--------------|---------|------------------|-------------------------|
| dhrystone   | 215015       | 1308    | 5628             | 0.209                   |
| hello_world | 325246       | 2789    | 10642            | 0.262                   |
| median      | 15015        | 207     | 810              | 0.432                   |
| mm          | 297038       | 644     | 2011             | 0.054                   |
| mt-matmul   | 41454        | 344     | 953              | 0.184                   |
| mt-vvadd    | 61072        | 759     | 2049             | 0.268                   |
| multiply    | 55016        | 546     | 1837             | 0.267                   |
| pmp         | 425          | 7       | 39               | 0.734                   |
| qsort       | 235015       | 2052    | 8951             | 0.305                   |
| rsort       | 375016       | 683     | 2077             | 0.044                   |
| spmv        | 70015        | 254     | 1154             | 0.132                   |
| towers      | 15016        | 72      | 237              | 0.126                   |
| vvadd       | 10016        | 111     | 316              | 0.252                   |
|             |              |         |                  |                         |
| Mean        |              |         |                  | 0.252                   |

- Validated in FPGA, soon in silicon
- Table shows encoding efficiency of the algorithm
- Does not include any overhead for encapsulating into messages or routing
- Different program types will have different overheads





- Overview
- Holistic SoC
- Architectural Overview
- Example Scenarios
- Processor Branch Trace
- Trace Encoder Interface
- Summary







- UltraSoC has been a member of the RISC-V Foundation since 2016
  - Leadership role in Debug Working Group
  - Trace specification, announced @ DAC 54, offered as open standard spec
- Industry's first and only commercial RISC-V debug & monitoring products
  - Including trace & run control
  - Scalable from lightweight IoT to heterogeneous multicore designs
- Publicly endorsed by major processor and tools vendors
  - Andes, Baysand, Codasip, Esperanto, Imperas, Lauterbach, Microsemi, Roa Logic, SiFive, Syntacore
- FPGA demonstrator available, IDE integrated
- Silicon proven



ultrasoc

- Zynq FPGA platform
  - RV32 RISC-V
  - Custom logic
  - 2x ARM A9
  - UltraSoC architecture
- Demo shows:
  - Trace Encoder configuration
  - RISC-V run control
  - Capture of instruction trace
  - Disassembly of instruction trace
  - Bus state
  - Performance histogram







# Contact details:

### Randy Fish

randy.fish@ultrasoc.com www.ultrasoc.com @UltraSoC







IF YOU ASK ME A QUESTION I DON'T KNOW, I'M NOT GOING TO ANSWER IT.



Yogi Berra Baseball Manager (Born 1925)





# Backup



### **Commercial in Confidence**







#### LEAKED LIST OF MAJOR 2018 SECURITY VULNERABILITIES

CVE-2018-????? APPLE PRODUCTS CRASH WHEN DISPLAYING CERTAIN TELUGU OR BENGALI LETTER COMBINATIONS. CVE-2018-????? AN ATTACKER CAN USE A TIMING ATTACK TO EXTPLOIT A RACE CONDITION IN GARBAGE COLLECTION TO EXTRACT A LIMITED NUMBER OF BITS FROM THE WIKIPEDIA ARTICLE ON CLAVDE SHANNON. CVE-2018-???? AT THE CAFE ON THIRD STREET. THE POST-IT NOTE WITH THE WIFI PASSWORD IS VISIBLE FROM THE SIDEWALK. CVE-2018-????? A REMOTE ATTACKER CAN INJECT ARBITRARY TEXT INTO PUBLIC-FACING PAGES VIA THE COMMENTS BOX. CVE-2018-????? MYSQL SERVER 5.5.45 SECRETLY RUNS TWO PARALLEL DATABASES FOR PEOPLE WHO SAY "5-Q-1" AND "SEQUEL" CVE-2018-????? A FLAW IN SOME X86 CPUS COULD ALLOW A ROOT USER TO DE-ESCALATE TO NORMAL ACCOUNT PRIVILEGES. CVE-2018-???? APPLE PRODUCTS CATCH FIRE WHEN DISPLAYING EMODI WITH DIACRITICS. CVE-2018-???? AN OVERSIGHT IN THE RULES ALLOWS A DOG TO JOIN A BASKETBALL TEAM. CVE-2018-???? HASKELL ISN'T SIDE-EFFECT-FREE AFTER ALL; THE EFFECTS ARE ALL JUST CONCENTRATED IN THIS ONE COMPUTER IN MISSOURI THAT NO ONE'S CHECKED ON IN A WHILE. CVE-2018-???? NOBODY REALLY KNOWS HOW HYPERVISORS WORK. CVE-2018-????? CRITICAL: UNDER LINUX 3.14.8 ON SYSTEM/390 IN A UTC+14 TIME ZONE, A LOCAL USER COULD POTENTIALLY USE A BUFFER OVERFLOW TO CHANGE ANOTHER USER'S DEFAULT SYSTEM CLOCK FROM 12-HOUR TO 24-HOUR. (VE-2018-???? x86 has way too many instructions. CVE-2018-????? NUMPY 1.8.0 CAN FACTOR PRIMES IN O(LOG N) TIME AND MUST BE QUIETLY DEPRECATED BEFORE ANYONE NOTICES. CVE-2018-????? APPLE PRODUCTS GRANT REMOTE ACCESS IF YOU SEND THEM WORDS THAT BREAK THE "I BEFORE E" RULE. CVE-2018-???? SKYLAKE X86 CHIPS CAN BE PRIED FROM THEIR SOCKETS USING CERTAIN FLATHEAD SCREWDRIVERS. (VE-2018-????? APPARENTLY LINUS TORVALDS CAN BE BRIBED PRETTY EASILY. CVE-2018-????? AN ATTACKER CAN EXECUTE MALICIOUS CODE ON THEIR OWN MACHINE AND NO ONE CAN STOP THEM. CVE-2018-????? APPLE PRODUCTS EXECUTE ANY CODE PRINTED OVER A PHOTO OF A DOG WITH A SADDLE AND A BABY RIDING IT. CVE-2018-???? UNDER RARE CIRCUMSTANCES, A FLAW IN SOME VERSIONS OF WINDOWS COULD ALLOW FLASH TO BE INSTALLED. CVE-2018-????? TURNS OUT THE CLOUD IS JUST OTHER PEOPLE'S COMPUTERS.

05/07/2018 CVE-2018-????? A FLAW IN MITRE'S CVE DATABASE ALLOWS ARBITRARY CODE INSERTION.











#### Safety **Security** Performance HW "stuck pixel detect" HW-based attack detect Run-time server optimization frame: 3 Latency anomalies: cpu0 Latency anomalies: cput Latency anomalies: cpu2 ts us 60 20 40 80 100 120

- Non-intrusive: No performance impact or "warning"
- Hardware: Fast, react at HW timescale; invisible to software
- Visibility: Analyze software and system everywhere in SoC, see any problem

05/07/2018

**Commercial in Confidence** 

UltraSoC IDE

Decoded trace showing source code and assembly Quick Access 🗄 😭 🚾 📾 🗏 🕵 🔻 🗉 S crt0.5 S latch address 🤓 \*guartz\_displa... 📄 ud\_ptrace6815... 🖂 Monitor C... 🔅 🖪 Monitor M... 🖪 Downstre.. Monitor Time View Project Explorer 📄 gdbinit T xbm1:0 Module Qualifier c ^ auartz ui core 0.elf Index Flags Time 0x70000010 jal \$ra, 0x128 bus traffic.multilaunch.launch 3000000 115161111195866665455444 29571 t 4560409... xbm2 axi monitor por... 0 Bus bus\_traffic.start-agent.launch 4570198.. 29653 t xbm1 axi monitor por... 0 ../src/blocks.c xbm1:1 instructions.pdf 4610409... xbm2 axi monitor por... 0 20:int main(int argc, char \*\*argv) { 29936 t activity 🔤 quartz\_display\_rw\_all.udt 29937 t 4610409 xbm2 axi monitor por... 0 \$sp, \$sp, 0xfffffd0 📄 slide.jpa 0x70000138 addi 30019 t 4620198... xbm1 axi monitor por... 0 xbm2:0 0x7000013C sw \$ra, 0x2c(\$sp) slideshow.pptx 30335 t 4660409... xbm2 axi monitor por... 0 0x70000140 SW \$s0, 0x28(\$sp) e1111e11111116E7776E5555545 > 😂 ceva\_jtag\_proxy 2800000 30336 t 4660409... xbm2 axi monitor por... 0 0x70000144 addi \$s0, \$sp, 0x30 ‰ 🐌 🟠 🔿 i⇒ 30418 t 4670198... xbm1 axi monitor por... 0 0x70000148 sw \$a0, 0xfffffdc(\$s0) 🏇 Debug 🛛 xbm2:1 0x7000014C sw \$a1, 0xfffffd8(\$s0) 30687 t 4710409... xbm2 axi monitor por... 0 v @ riscv-ptrace.start-agent [UltraDebug Agent] 70000 4710409... 30688 t xbm2 axi monitor por... 0 Debug Memory Target 22: seed = 1; 30781 t 4720198... xbm1 axi monitor por... 0 🚚 UltraDebug Agent sm1 31058 t 4760409 xbm2 axi monitor por... 0 Target Communications 0x70000150 lui \$a5, 0x70000 31059 t 4760409... xbm2 axi monitor por... 0 v @ riscv-ptrace.riscv [UltraDebug Remote Target] 0x70000154 addi \$a4, \$zero, 1 4770198... xbm1 0x70000158 sw \$a4, 0x2b4(\$a5) 31147 t axi monitor por... 0 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 2,370,198,586 Phread #1 (Running : User Request) 24: draw block(0. 0. 800, 480, 0x00000000); // clear s < > 4.770.198.588 openocd.exe Go Scroll Go to: Index C:/UltraSoC/demos/2018\_03\_15/riscv-tools/gdb.exe (7.12.50.2) 0x7000015C addi \$a4, \$zero, 0 Control v subus traffic.arm0 [UltraDebug Remote Target] R - -I Monitor Data 🐵 Traffic Generators 🕕 Memory 🤓 Configuration 🥯 PTrace < > 1 🛛 🕶 🕞 😭 🖶 🖛 🗖 rte1 configuration system Console 9 Error Log Virtual Console ΠE Index Src ID Cha., Packet VC.Channel: vc1.0 N full address Y Sort By: Hierarchy 20, 0.0006014573 0 0 0x0 te\_support enable Y fast 0x0 addr 0x70000004 21, 0.0004009715 0 sync ~ 
rme1 (message engine) 22, 0.0002673144 2 0 0x0 full delta branch map YYYNYYYNN addr 0x70000174 itm1 (itm) 23, 0.0001782096 3 0x0 full delta YYYNYYYNN addr 0x7000018C pam1 (xbpam) 24, 0.0001188064 Trace 0x0 full delta YYYNYYYNN addr 0x700001A8 4 0 branch map rte1 (rte) 25, 0.0000792043 0x700001C0 5 0 0x0 full\_delta branch\_map YYYNYYYNN addr 🗆 si1 (si) 26, 0.0000528028 Packets 0x700001DC Ð 6 0 0x0 full delta branch map YYYNYYYNN addr sm1 (sm) 27, 0.0000352019 0 0x0 full delta branch\_map YYYNYYYNN addr 0x700001F8 28, 0.0000234679 Connections 8 0x0 full\_delta branch\_map YYYNYYYNN addr 0x70000214 29, 0.0000156453 v 9 0 0x0 full\_delta branch map YYYNYYYNN addr 0x70000230 30, 0.0000104302 sub0.usb (route=1, flows=0,1,2) 10 0 0x0 full delta branch map Y addr 0x7000023C 31, 0.0000069535 0x0 full\_addr\_only addr 0x70000258 11 0 32, 0.0000046356 12 0x0 full addr only addr 0x7000026C 0 33, 0.0000030904 13 0 0x0 full delta branch map YYYNYYYNN addr 0x70000294 34, 0.0000020603 0x0 branch map 0x7000023C 14 0 full delta Y addr full addr only 0.70000260 0-00

> ~ ٢

Writable

1:1

Insert

>







- Clearly stated messages on each slides
- Netspeed integration: Show block diagram before/after
- Long Tail , which tail?
- The Challenge
  - Heterogenous, multicore, DSP, ... DSA, busses, data rates and width
  - Debug of bits → transactions → ???
  - Show a simple slides of history and future of analytics
  - Maybe a hierarchy where analytics consists of debug?
  - The challenge is debugging FW and SW in a complex SoC or maybe even a multi-die, multi chip solution
  - 000,
  - Workloads of tomorrow are unknown. You cannot predict and systems need to be optimized with real data. Thus, tuning with insilicon analytics
- What we offer
  - Show blow-up of a few critical IP blocks and highlight configurability at compile time and run time
  - Cross triggering is reconfigurable at runtime (need slides for cross triggering
- Highlight key words graphically Nonintrusive, heterogeneous, ...

**Commercial in Confidence** 





- Challenges change over time
  - Single core
  - Multi core
  - Hetero generous
  - Custom accelerators
  - Memory throughput
  - You cannot do all the analysis off-chip
    - Trace got there a long time ago
- SoCs and FPGAs
- Sim, Em, Pro, Post-Si, In-Field

### **Commercial in Confidence**





- Visibility of whole SoC is critical. Maybe multiple SOCs
- Build time vs runtime: what can be done at each
- Analytics debug ++, system level, not just a processor
- Analytics are critical for post-silicon optimization
- Workloads are unknow and systems are tuned to a workload during design
- Demo slide





# IF THE WORLD WAS PERFECT, IT WOULDN'T BE.

QuoteHD.com

YORK

Yogi Berra Baseball Manager (Born 1925)

40





# intപ്രീ

### EM-51 8051 EMULATION BOARD

- Emulates 8051/8751/8031 functions on a 2.75" × 5.25" board assembly
- Replaces the 8051/8751/8031 in prototype systems

- Plugs directly into 8051/8751/8031 sockets
- Includes a 2732A EPROM device for program memory

The EM-51 emulation board is a small, ready-to-use microcomputer assembly that replaces an Intel 8051 family single-chip microcomputer in a prototype system. EM-51 includes sockets for 2716 or 2732 EPROMs, which substitute for the 8051/8751 on-chip program memory during prototype development. An Intel 2732A 4K x 8 EPROM is included with the board. With the memory in place, an EM-51 board becomes a full functional and electrical equivalent of the 8051 or 8751 microcomputer.



