LLM Model Serving Platform Comparison

Type: Vault Report

Specs: STAC-AI™ LANG6

This study evaluates two model-serving platforms, vLLM and Hugging Face’s text-generation-inference (TGI), for large language model (LLM) inference using the STAC-AI™ LANG6 (Inference-Only) Test Harness. The STAC-AI™ benchmark provides industry-standard testing to assess the performance, efficiency, and reliability of LLM inference infrastructure in real-world conditions. We analyze Inference Rate, Throughput, and Fidelity across four workloads, incorporating short- and long-context datasets with both 8B and 70B parameter models. Key findings highlight differences in Inference Rate, the impact of platform versions on serving efficiency, variations in response consistency, and patterns of non-determinism in generated outputs. These insights offer valuable guidance for firms optimizing their LLM infrastructure.

Please log in to see file attachments. If you are not registered, you may register for no charge.

The STAC-AI Working Group focuses on benchmarking artificial intelligence (AI) technologies in finance. This includes deep learning, large language models (LLMs), and other AI-driven approaches that help firms unlock new efficiencies and insights.