Posted April 20, 2025
Research Note

Performance And Efficiency Comparison Between Self-Hosted LLMs And API Services

We recently conducted a study comparing self-hosted LLMs and equivalent API models using the STAC-AI

The report is
now available
to STAC Insights subscribers

Reports

The reports and detailed configuration information are now available to eligible subscribers at the links above. To learn more about subscription options, please contact us.

Organizations deploying Large Language Model (LLM) applications face a critical infrastructure decision: self-host open-source models or leverage third-party API services. Self-hosting offers control but requires significant investment in compute resources and operational management. Conversely, API services provide immediate access with a pay-as-you-go model but may lack performance guarantees.

This choice involves complex trade-offs between performance, cost-efficiency, and consistency, yet direct, data-driven comparisons are scarce. To address this, we performed a study that evaluated these two primary deployment methods. Our analysis moves beyond theoretical discussions to present a quantitative assessment of how real-world infrastructure choices impact key operational metrics under typical financial workloads.

Our analysis is structured around two distinct experimental designs: same-model comparison and a cross-model comparison.

  • Same-Model Comparison: This evaluates identical open-source models in both a self-hosted configuration and as an API service from Lambda Labs.
  • Cross-Model Comparison: This compares a self-hosted model against a capability-equivalent proprietary model.

We utilized the STAC-AI™ LANG6 (Inference-Only) benchmark to measures latency (Reaction Time, Response Time), throughput (Output Rate), and efficiency (Inferences per Dollar) under different loads to ensure a robust comparison. All self-hosted workloads were executed on a cloud VM equipped with 8 GPUs. For consistency, the API workloads were run from a separate cloud VM to query the respective API provider endpoints. Furthermore, to address the lack of performance guarantees from API providers, a supplementary experiment was conducted that measured the performance of API providers for an extended period, to highlight potential cyclical variations.

Our findings reveal a nuanced relationship between system optimization, performance, and cost-efficiency, offering a quantitative framework to guide infrastructure decisions between self-hosting and API services. The report details how the performance ratio between self-hosted and API models varies significantly depending on the model size, workload characteristics, and the degree of optimization of the self-hosted environment. Our supplementary analysis of API services uncovers important patterns, including intraday performance cycles and idiosyncratic events, providing critical insights for practitioners who require predictable and guaranteed performance levels for their applications.

Sign up to
our newsletter