Reports
The reports and detailed configuration information are now available to eligible subscribers at the links above. To learn more about subscription options, please contact us.
Effectively deploying Large Language Models (LLMs) systems require navigating the intrinsic trade-offs between performance, operational cost, and model quality. As enterprises in demanding sectors like finance integrate these models into critical solutions, the need for standardized benchmarks has become increasingly important. Correspondingly, benchmarking needs have evolved from simple throughput assessments to comprehensive evaluations that consider latency under various loads, quality preservation under optimization, and cost efficiency at scale.
This paper provides our design philosophy regarding benchmark development from the viewpoint of key stakeholders. This involved decomposing benchmarks into their components, of which the model, workload and system are core, as well as the importance of trustworthiness with the aim to equip practitioners with the necessary knowledge to select appropriate benchmarks for their specific contexts and to interpret the resulting data meaningfully.
We then use this design philosophy to deconstruct major performance benchmarks, including Artificial Analysis, Hugging Face's LLM-Perf, and MLPerf, in comparison to the STAC-AI™ LANG6 benchmark. By evaluating their distinct approaches to workload construction, metric formulation, and the establishment of trustworthiness. Our comparative analysis highlights differences in benchmark design, such as using synthetic versus domain-specific workloads or assessing API services versus self-hosted systems. We also discuss the challenges in trust worthiness and interpret benchmark results that can lead to wrong conclusions as and how each benchmark tackles this challenge.
The STAC-AI™ LANG6 (Inference-Only) benchmark distinguishes itself through several key design choices aimed at maximizing relevance and trustworthiness for financial institutions, such as domain-specific workloads derived from real financial filings, user centric metric formulation, integrated efficiency and quality metrics, and mandatory audit process for all published results.

