Reports
The reports and detailed configuration information are now available to eligible subscribers at the links above. To learn more about subscription options, please contact us.
Large Language Models (LLMs) are increasingly used to evaluate the quality of text generated by other models, particularly in the context of Retrieval-Augmented Generation (RAG) systems. While these LLM-based metrics promise a more nuanced, semantic understanding than traditional metrics like ROUGE, their reliability is not fully understood. This research note presents several key findings regarding the behavior of LLMs as evaluators, proposing best practices based on these insights.
This work originated from our efforts to design a suitable LLM-based quality metric for the STAC-AI™ LANG6 (Inference-Only) benchmark. While assessing the performance (latency and throughput) of LLM inferencing, a robust quality measure is essential to ensure comparability of performance results. Our research identified several areas of importance, but we were especially interested in the choice of the evaluator model and its potential impact on evaluation scores. Specifically, we were concerned with the "relatedness" between the model generating the response and the model evaluating it, and the potential for this relationship to introduce bias.
To investigate this, we designed experiments to understand the relationship between generator-evaluator model pairings and their effect on scoring-based evaluations. Scoring-based evaluations leverage a model's deep understanding of language to assess the quality of generated text, typically by ranking responses or assigning scores based on predefined criteria. Our study centers on "Faithfulness," a common RAG metric that assesses the degree to which a generated response can be factually inferred from its retrieved source context. Our methodology involved:
- Response Generation: A baseline set of 3,500 RAG responses generated using our STAC-AI™ Test Harness.
- Multi-Model Evaluation: These responses were then scored by a panel of four distinct but related evaluator LLMs, including the generator model itself (a self-evaluation scenario) and other models of different sizes and generations within the same family.
- Consistency Analysis: To measure the reliability of the evaluators, we conducted a deeper analysis on a subset of the data, repeating the evaluation process for 10 consecutive runs to quantify the variation and consistency of the scores assigned by each LLM evaluator.
The key findings challenge common assumptions about LLM-based evaluation and uncover nuanced evaluator behaviors.
- Evaluator Capability Outweighs Model Relatedness: Contrary to our initial hypothesis, the results indicate that score differences are more strongly determined by the inter- and intra-generational reasoning capabilities of the evaluator models than by their "relatedness" to the generator.
- A Link Between Variation and "Accuracy": Our analysis revealed a strong inverse relationship between scoring consistency and verdict "accuracy”. This calls into question the usefulness of Faithfulness as a feedback mechanism during the development of RAG systems. Furthermore, this phenomenon is likely not isolated to Faithfulness and impacts the other LLM-based RAG evaluation metrics that utilize scoring rubrics.

