STAC Research Note: Performance And Efficiency Comparison Between Self-Hosted LLMs And API Services

A comparison between self-hosted LLMs and API services
21 April 2025
We recently conducted a study comparing self-hosted LLMs and equivalent API models using the STAC-AI™ LANG6 (Inference-Only) Test Harness.
The STAC-AI™ benchmark provides rigorous, industry-standard testing for LLM inference infrastructure, helping firms assess performance, efficiency, and reliability in real-world conditions.
This research note examines the inference performance and efficiency of self-hosted models and compares them to 1) the same models available through API services (same-model comparison), as well as 2) a different but similar capability closed-source model (cross-model comparison).
We made 3 comparisons, 2 same-model comparisons and 1 cross-model comparison. The models in each comparison were given the same Interactive workload. Key areas of analysis include:
- Differences in Latency (Reaction Time, Response Time, and Output Rate) for various arrival rates.
- Differences in Inference efficiency for various arrival rates.
We also examined the performance variation of models available through API services. We ran our benchmarks on API models periodically for an extended period, to analyze Intraday and Intraweek performance trends for two LLM API services.
STAC subscribers can access the full report here. Subscribers also can run STAC-AI benchmarks in the privacy of their own labs to test their systems. To learn about subscription options, please contact us.
About STAC News
Read the latest about research, events, and other important news from STAC.