

Shining a Light on Full FPGA and ASIC Performance

Adam Sherer, Account Technical Executive, Cadence Design Systems STAC Fall 2022



## Agenda

- Performance in low-latency trading systems
- Simulation-based performance analysis
- Using assertions for performance analysis
- System level performance analysis
- Call to action



## Shining a Light on Full FPGA and ASIC Performance

- FPGA and ASIC HFT accelerators are architected and coded for functionality and performance
- It works!!! ... but not fast enough in all conditions
- Performance anomalies often found in lab or live
  - Occur due to core cases and/or system saturation
  - Visibility inside FGPAs is limited for debug
  - For ASIC, analysis must be done pre-silicon
- Verification shines for visibility and analysis
  - Block order of algorithms and data paths
  - Interfaces data transport efficiency
  - Full system memory and on chip bus latency



Source: "The Architecture of HFT System", Tejasvi Shiv, June 7, 2021, FPGAs for Stock Market Trading: https://medium.com/fpgas-for-stock-market-trading/the-architecture-of-hft-system-713e64604a61



## Designing Hardware Blocks for Performance

- Know the order of your algorithms in hardware
  - Individual blocks may run fast enough but become bottlenecks when system scales
- Maximize combinatorial logic
  - Unroll loops to reduce clocks for computation
  - Pre-calculate invariants to reduce computation in loops and/or data path
  - Create testbench monitors to observe combinatorial performance
  - Add clocking domains to reduce clock skew issues
- Perform all price and time processing in fixed point (integer) notation (assumed to be common practice today)
  - Utilize incremental vs full value calculations to reduce computational area



## Analyze Regression for Performance Issues and Opportunities

6.8 96670 [16049] worklib.dsp delay:v (file: /tmp/myproject/dsp/rtl/dsp delay.v line: 29)

3.3 47408 [ 3704] worklib.dsp\_add:v (file: /tmp/myproject/dsp/rtl/dsp\_add.v line: 64)

3.9 55715 [ 703] worklib.gen mux ohot:v (file: /tmp/myproject/gen/rtl/gen mux ohot.v line: 30)

3.4 49232 [ 1163] worklib.inv\_tim\_adj:sv (file: /tmp/myproject/bench/models/inv\_tim\_adj.sv line: 5) 3.3 47636 [16952] worklib.dsp convert:v (file: /tmp/myproject/dsp/rtl/dsp convert.v line: 56)

1.9 27181 [ 1680] worklib.gen\_counter\_ripple\_v2:v (file: /tmp/myproject/gen/rtl/gen\_counter\_ripple\_v2

Modules (blocks)

-----Stream Counts (1428894 hits total)

Most Active Modules (behavioral)

%hits #hits #inst name

Total hits per always at line in file

Total time per module at line in file

Streams (ex. always stmts)

%hits #hits\_#inst\_name
4.0 57416 [16049] Always stmt (file: /tmp/myproject/dsp/rtl/dsp\_delay.v, line: 65 in worklib.dsp\_delay [module 3.8 53586 [ 703] Continuous Assignment (file: /tmp/myproject/gen/rtl/gen\_mux\_ohot.v, line: 53 in worklib.ger 2.0 29182 [16952] Always stmt (file: /tmp/myproject/dsp/rtl/dsp\_convert.v, line: 326 in worklib.dsp\_convert [m 1.9 27688 [16049] Continuous Assignment (file: /tmp/myproject/dsp/rtl/dsp\_delay.v, line: 76 in worklib.dsp\_de 1.8 25784 [ 3704] Continuous Assignment (file: /tmp/myproject/dsp/rtl/dsp\_add.v, line: 184 in worklib.dsp\_add.

- Performance issues may only manifest in regression
- Most active code (hit rate) may be different for different tests
- Target most active code to optimize performance





Merged

**Profiles** 



## Digger Deeper with Single Run Profiling

- Rerun tests for instancelevel performance details
- Determine if issue is in all or individual instances

 Modify code, reprofile, rerun regression



Examine instances vs. modules



### Add Assertions to Monitor and Prove Performance

- Shine the light on critical paths that must complete within a max clock count
  - Critical datapaths
  - Resource request/grant combinations
  - Asynchronous requests delay critical path
- Shine the light on streaming invariants
  - Resource must never livelock/deadlock
- Simulation is good, formal is better
  - Simulation depends on stimulus for all cases
  - Formal proves all cases without stimulus

\$rose(req) |=> ## [M:N] \$rose(ack)

rising edge of "ack" must occur between M and N clocks after risking edge of "req"

Best done with advanced lint checking



## Examine Data Transport Efficiency Between System Layers

Layering system distributes tasks to multiple designers

Interfaces assure clear communication among teams

Inefficiency can occur if sequential layers recreate intermediate calculations

Refactor interfaces if intermediate calculations are shared



## Memory Subsystem Performance Challenges Degradation Causes

#### **System**

- Inefficient ID reuse resulting ID collision
- Traffic causes read-modify-write memory access
- Load balancing
- Quality of Service (QoS)
- Cacheable and Bufferable transaction attributes (posted / non-posted traffic)



## Interconnect Performance Challenges **Degradation Causes**



System Performance Analysis Approach (1 of 3)

Memory DDR utilization

With page hit/miss indication

 Max utilization line indicates the system potential for the current clock frequency

 Table of DDR commands list all commands in the current time window



## System Performance Analysis Approach (2 of 3) On chip bus over time analysis

- Quickly understand the relationship between:
  - Bandwidth over time
  - Latency over time
  - Outstanding transaction over time
- Allows bottlenecks to be identified and investigated



## System Performance Analysis Approach (3 of 3) On chip bus latency analysis

 Quickly identify outlier transactions with high latency and investigate the time period when they occur

In all three analysis
 examples UVM or similar
 testbench messaging and/or
 assertions/checkers should
 be used to identify
 suspicious tests for analysis



Cadence: Your ASIC/FPGA Partner

- Verification solution: Apply objective analysis to improve FPGA and prepare for ASIC
- ASIC solution: Broadly adopted in high-speed comms and mission-critical applications
- Tensilica® IP: Proven processor technology used in autonomous drive and other high-reliability apps
- High-performance IP: Proven in leading comms systems
- Services: Expert RTL to GDS design services

Digital Design to Implementation





#### Cadence IP Solutions

Silicon-proven in advanced nodes



Cadence extensive Design IP, Verification IP (VIP), Tensilica® IP, and memory models to ensure complex SoC designs correctly on first pass

## Call to Action: Shine A Strong Light on Performance

- Data, data, data!
  - Performance analysis requires simulation cycles to generate data...
  - ... except where formal analysis can be applied
- Profile, profile, profile!
  - Functionality and performance go together for HFT
  - Functionality without performance can be a competitive disadvantage
  - Performance without functionality can be an even higher risk!

- Start simple and build up
  - Add assertions these will help debug as you rerun suspected lab sequences in simulation
  - Make profiling a design review task ask designers to comment on performance analysis
  - $_{\circ}$  Build to formal methods and system performance analysis ightharpoonup especially for ASIC!



# cādence®

© 2022 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks found at <a href="https://www.cadence.com/go/trademarks">https://www.cadence.com/go/trademarks</a> are trademarks or registered trademarks or registered trademarks or farm Limited (or its subsidiaries) in the US and/or elsewhere. All MIPI specifications are registered trademarks or registered trademarks or service marks owned by MIPI Alliance. All PCI-SIG specifications are registered trademarks or trademarks are the property of their respective owners.