Posted February 13, 2025
Research Note

LLM Model Serving Platform Comparison

We recently conducted a study comparing two model-serving platforms - vLLM and Hugging Face’s TGI

The report is
now available
to STAC Insights subscribers

Reports

The reports and detailed configuration information are now available to eligible subscribers at the links above. To learn more about subscription options, please contact us.

The STAC-AI™ LANG6 (Inference-Only) benchmark is a new technology benchmark standard, developed in collaboration with AI experts and technologists from leading financial institutions, for evaluating the infrastructure performance of LLM applications. While the word “benchmark” in the context of LLMs often refers to how well an LLM can answer a class of questions or create human-quality responses, the initial goal for STAC-AI is to be an infrastructure performance benchmark, not a data science challenge.

Our report details a comparative analysis of two of the most popular open-source llm- serving platforms: vLLM and Hugging Face's Text Generation Inference (TGI). The study aims to determine a suitable reference platform for the STAC-AI™ LANG6 benchmark but also provide insights relevant to financial firms seeking to deploy LLM workloads.

Using the STAC-AI™ LANG6 (Inference-Only) Test Harness, we evaluated both platforms across a series of workloads designed to simulate common financial sector use cases. The workload consists of popular open-sourced LLMs, and a pair of long and short context data sets. Key performance metrics such as Inference Rate (inferences per second) and Throughput (generated words per second) were measured using the Test Harness. In addition to performance, we also investigated the impact model serving platforms would have on the consistency of generated responses, as well as their potential causes, measured using our quality metric, Fidelity.

Our experiment shows that the performance of both were comparable to each other but varied across versions. Similarly, consistency of generated responses for both platforms was equal. However, our supplementary experiment regarding causes of inconsistency, showed that these non-determinisms occur more frequently at certain locations in LLM response.

Sign up to
our newsletter