Financial firms use neural network models to analyze time series of market data. These models are then deployed to run on live, real-time data feeds—often within co-located datacenters near exchanges where microseconds matter. Use cases for this type of co-located inference include high-frequency market making, short- and medium-term price prediction, setting dynamic price bands for pre- and post-trade risk checks, and automated hedging and position management. In these environments, firms need to know how well a technology stack can handle machine learning inference workloads under tight latency, throughput, and efficiency constraints. That is the goal of the STAC-ML Markets (Inference) benchmark.
About the Benchmark
Designed by quantitative researchers and technologists from leading global financial firms, the benchmarks test solutions which perform inference using one or more instances of the same model, each operating on different input streams (for example, applying the same equities prediction model to multiple portfolios). The objective is to measure the upper bound of inference-only performance, isolating it from other parts of a typical production pipeline such as data ingestion, parsing, or feature generation. This approach enables a clear view of the theoretical limits that a given configuration could provide when integrated into larger trading systems.
Each benchmark run measures several performance dimensions:
- Latency: end-to-end time to produce an inference result
- Throughput: number of inferences per second
- Energy and space efficiency: inferences per joule or per rack unit
- Error: deviation from a quality reference model implementation
The STAC-ML Markets (Inference) specification defines two distinct but related suites: Sumaco and Tacana.
Sumaco models a use case where inference is triggered by external events. Each full window of data is transmitted as a contiguous block, with no reuse of prior computations. This reflects workloads typically supported by general-purpose inference engines such as ONNX or TensorFlow.
Tacana models continuous inference on a sliding window, such as inference on every tick or bar of market data. This suite allows re-use of computations and reduced data transfer overhead, reflecting custom or optimized inference engines often used in production trading systems.
Both suites use identical data generation, LSTM model definitions, and performance metrics, differing primarily in how data is fed and how state is managed between inference calls.
Scope and Metrics
Each system under test (SUT) runs inference across three standard LSTM model sizes—LSTM_A, LSTM_B, and LSTM_C—and with different numbers of model instances (NMIs).
The benchmark focuses on the “inferring period,” a fixed duration following system warm-up, during which latency, throughput, and efficiency metrics are recorded.
For non-cloud systems, efficiency is measured in energy and space terms; for cloud-based systems, efficiency may be expressed as cost per inference.
Results Summary
In this audit, STAC tested a configuration featuring an NVIDIA GH200 Grace Hopper Superchip in a Supermicro ARS-111GL-NHR server. The full stack was:
- NVIDIA CUDA Toolkit 12.9
- NVIDIA CUDA Deep Neural Network library (cuDNN) 9.10.2
- Supermicro ARS-111GL-NHR
- NVIDIA GH200 Grace Hopper Superchip
- NVIDIA Grace CPU and NVIDIA Hopper GPU connected using NVIDIA NVLink Chip-2-Chip (C2C) 900GB/s interlink
- 480GiB of ECC LPDDR5X memory @ 6400Mhz
- 96GiB of ECC HBM3 memory
- Ubuntu 22.04.5 LTS Server with HWE kernel
Compared to a previously tested FPGA-based system, the system demonstrated the following:
- For LSTM_A (the smallest model) the 99p latency was between 7% and 20% lower
- 7% lower with 1 NMI (4.70μs vs. 5.07μs)
- 20% lower with 8 NMI (4.67μs vs 5.97μs)
- The 99p error benchmark was 8 times lower (0.00111 vs 0.00889)
- For LSTM_B (the medium model) the 99p latency was between 3% higher and 8% lower
- 3% higher with 1 NMI (7.10μs vs. 6.89μs)
- 8% lower with 4 NMI (7.10 μs vs 7.73μs)
- The 99p error benchmark was 12 times lower (0.00102 vs 0.0127)
- For LSTM_C (the largest model) with 1 NMI:
- The 99p latency was 49% lower (15.8μs vs. 31.0μs)
- The throughput was 15% higher (3,910 vs. 3.387)
- The 99p error benchmark was 13 times lower (0.00172 vs 0.0237)
- The energy efficiency was 44% higher (8,312 vs 5,785)
- The largest ratio of maximum to median latency is 9.65 ~ 38.3μs / 3.97μs (occurs at LSTM_A with NMI=8). The smallest ratio of maximum to median latency is 2.16 ~ 32.2μs / 14.9μs (occurs at LSTM_C with NMI=1)
“Supermicro is excited to continue collaborating with STAC to showcase how our optimized servers perform on specific benchmarks. Alongside NVIDIA, we can meet the demanding needs of the Financial Services Industry. Our compact server with the NVIDIA Grace Hopper™ Superchip completed the STAC ML benchmark at record speed with very low latency, outperforming a previous run using an FPGA card. Supermicro remains committed to working closely with the FSI community to enable a new, faster class of applications with lower energy costs and higher performance.” - Vik Malyala, President & Managing Director EMEA, SVP Technology & AI at Supermicro
Accessing the Results
The full benchmark reports are available here.
NVIDIA's coverage of the results from GTC can be found here.

