STAC Reports: STAC-ML on an NVIDIA A100 in a Supermicro
First GPU-based STAC-ML project assesses multiple optimizations.
2 February 2023
STAC performed the first STAC-ML™ Markets (Inference) benchmark tests on a stack using a GPU as an accelerator: an NVIDIA A100 GPU in a Supermicro server. The same stack was configured in multiple ways to showcase different optimizations:
- Latency-optimized for the sliding-window benchmark suite (code named Tacana).
- Throughput-optimized for the fixed-window benchmark suite (code named Sumaco).
- Throughput-optimized for the Tacana suite (STAC Vault report available to premium subscribers).
STAC-ML Markets (Inference) is the technology benchmark standard for solutions that can be used to run inference on realtime market data. Designed by quants and technologists from some of the world's leading financial firms, the benchmarks test the latency, throughput, realized precision, energy efficiency, and space efficiency of a technology stack across three model sizes and different numbers of model instances (NMI).
The stack submitted by NVIDIA consisted of the STAC-ML Pack for CUDA and cuDNN running on an NVIDIA A100 80GB PCIe GPU in a Supermicro Ultra SuperServer SYS-620U-TNR.
NVIDIA wished to highlight the following results from the configuration that was latency-optimized for the Tacana suite:
- For LSTM_A (the smallest model) the 99th percentile latency was:
- 35.2 μsec for 1 NMI (STAC-ML.Markets.Inf.T.LSTM_A.1.LAT.v1)
- 58.8 μsec for 32 NMI (STAC-ML.Markets.Inf.T.LSTM_A.32.LAT.v1)
- For LSTM_B the 99th percentile latency was:
- 68.5 μsec for 1 NMI (STAC-ML.Markets.Inf.T.LSTM_B.1.LAT.v1)
- 149 μsec for 32 NMI (STAC-ML.Markets.Inf.T.LSTM_B.32.LAT.v1)
- For LSTM_C (the largest model) the 99p latency was:
- 640 μsec for 1 NMI (STAC-ML.Markets.Inf.T.LSTM_C.1.LAT.v1)
- 748 μsec for 16 NMI (STAC-ML.Markets.Inf.T.LSTM_C.16.LAT.v1)
- Across all LSTM models and NMI tested, the largest outlier was 2.3x the median latency
- Median latency 35 μsec, max latency 81.3 μsec
NVIDIA also wished to highlight several results from the configuration that was throughput-optimized for the Sumaco suite:
- For LSTM_A (the smallest model), across all NMI tested:
- Total throughput was between 1.629M and 1.707M inf/sec
- Energy efficiency was between 1.724M and 1.798M inf/sec/kW
- For LSTM_B, across all NMI tested:
- Total throughput exceeded 190K inf/sec
- Energy efficiency ranged from 205,948 – 206,072 inf/sec/kW
- For LSTM_C (the largest model), across all NMI tested:
- Total throughput was 12,800 inf/sec
- Energy efficiency ranged from 17,726 – 17,747 inf/sec/kW
- Across all tested LSTM Models and NMI, the greatest deviation in instance throughput from the median instance throughput was 1.3% (1.618M inf/sec min, 1.639M inf/sec median).
For details, please see the reports at the links above. Premium subscribers have access to the STAC Vault Report, extensive visualizations of all test results, the micro-detailed configuration information for the solutions tested, the code used in this project, and the ability to run these same benchmarks in the privacy of their own labs. To learn about subscription options, please contact us.
About STAC News
Read the latest about research, events, and other important news from STAC.