Reports
Detailed configuration information are now available to eligible members at the link above. To learn more about subscription options, please contact us.
STAC has completed a STAC-AI™ LANG6 (Inference-only) benchmark audit on the HPE ProLiant DL380a Gen12, powered by NVIDIA H200 GPUs.
Large language models have taken on complex research and analytical tasks in finance — from digesting unstructured filings to assisting with client communication. But delivering those capabilities at the responsiveness and cost efficiency demanded by production environments is not straightforward. Hardware, software, and models are changing rapidly, and building the right stack has become a continuous engineering challenge.
That challenge led the STAC-AI Working Group — a collaboration of quants and technologists from leading global financial firms — to create STAC-AI LANG6. This benchmark suite provides a vendor-neutral framework for measuring the infrastructure performance of LLM inference workloads common in enterprise and financial applications.
Unlike benchmarks that focus on linguistic accuracy or reasoning quality, STAC-AI is designed for the stage after a model has been chosen — when infrastructure teams must host it efficiently and reliably. The first pattern addressed by the Working Group is retrieval-augmented generation (RAG), where a model generates answers grounded in retrieved data. Within RAG systems, inference is often the most latency-sensitive and resource-intensive step, so the LANG6 (Inference-only) suite isolates and measures that phase.
The benchmark evaluates a complete stack under test — hardware, software, and configuration — and records latency, throughput, energy efficiency, space efficiency, and fidelity across multiple model sizes and deployment scales. The result is a consistent, audited view of how different infrastructure choices perform under realistic workloads.
The audited HPE ProLiant DL380a (8× NVIDIA H200 NVL) system delivered notable results:
- Up to 165 inferences per second with sub-200 ms median reaction times
- Smooth token streaming at 2.9 – 40 words per second
- Throughput up to 23,600 words per second for 8B models
- >90% fidelity, even for 70B models
- Efficient scaling for demanding inference and RAG workloads
These results establish a clear reference point for LLM infrastructure performance in financial workloads such as document summarization and question-answering over regulatory filings. For the first time, firms can compare inference performance using standardized, independently audited metrics that reflect realistic enterprise conditions.
The report is now available to download from the link on the left

