Posted April 23, 2026
STAC Report

STAC-AI™ LANG6 on Supermicro SuperServer with NVIDIA RTX PRO 6000 Blackwell Series GPUs

Reports

Detailed configuration information are now available to eligible members at the link above. To learn more about subscription options, please contact us.

STAC recently completed a STAC-AI™ LANG6 (Inference-only) benchmark audit on a Supermicro SYS-222C-TN server hosting 2x NVIDIA RTX PRO 6000 GPUs managed by Red Hat OpenShift.

 

Stack Under Test (SUT):

  • STAC-AI™ LANG6 (Inference-Only) Pack for NVIDIA TensorRT-LLM (Rev D)
  • NVIDIA TensorRT-LLM 1.2.0rc2 with PyTorch backend
  • NVIDIA TensorRT 10.13.3.9
  • NVIDIA Model Optimizer (nvidia-modelopt) 0.37.0 for NVFP4 quantization
  • PyTorch 2.9.0a0 (NVIDIA PyTorch container 25.10)
  • Red Hat Enterprise Linux CoreOS 9.6
  • Red Hat OpenShift Container Platform 4.20
  • Supermicro Super Server SYS-222C-TN (2U CloudDC with DC-MHS)
    • 32 x 64GiB DDR5 DIMMs @ 5200MTs (2TiB total)
    • 2 x Intel® Xeon® 6730P CPUs
  • 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs, each with 96GiB of memory

STAC-AI is the industry benchmark for LLM inference on capital markets data, measuring latency, throughput, energy efficiency, and space efficiency across multiple model sizes.

 

Highlights from the audit:

EDGAR4a Batch mode
• The system achieved 32.9 inferences/s and 5,549 words/s on Llama-3.1-8B EDGAR4a

EDGAR4a Interactive mode
• The system achieved a 4.00x increase in arrival rate, from 7.50 to 30.0 inferences/s
• At 30.0 inferences/s, the system still operated at about 91% of the 32.9 inferences/s batch-mode rate

EDGAR5a Batch mode
• The system achieved 0.345 inferences/s and 139 words/s on Llama-3.1-8B EDGAR5a
EDGAR5a Interactive mode
• The system achieved a 4.00x increase in arrival rate, from 0.0800 to 0.320 inferences/s
• At 0.320 inferences/s, the system still operated at about 93% of the 0.345 inferences/s batch-mode rate

EDGAR4b: Batch mode
• The system achieved 5.28 inferences/s and 834 words/s on Llama-3.1-70B EDGAR4b
EDGAR4b: Interactive mode
• The system achieved a 4.00x increase in arrival rate, from 1.25 to 5.00 inferences/s
• At 5.00 inferences/s, the system still operated at about 95% of the 5.28 inferences/s batch-mode rate

The full report is now available to download from the link on the left.

STAC Insights subscribers gain access to detailed visualizations, configuration data, benchmark code, and the ability to run these tests in their own labs. Please log in to access the reports. For subscription options, contact us.

Sign up to
our newsletter