Jump to content

LLM-as-a-Judge

From Wikipedia, the free encyclopedia

LLM-as-a-Judge or LLM-based evaluation is a conceptual framework in natural language processing (NLP) that employs large language models (LLMs) as evaluators to assess the performance of other language-based systems or outputs. Instead of relying solely on human annotators, the approach leverages the general language capabilities of advanced language models to serve at automated judges.

In the vision–language domain, similar approaches extend to vision–language models (VLMs), which can act as evaluators (or “VLM-as-a-Judge”) by assessing multimodal outputs involving both text and images/video.[1]

LLM-as-a-Judge may be more cost-effective and may be added to automated evaluation pipelines. Unlike traditional automatic evaluation metrics such as ROUGE and BLEU, which rely on transparent, rule-based comparisons with surface-level n-grams, LLM-as-a-Judge relies on the opaque internal reasoning of large language models. The LLM-based evaluations likely incorporate deeper semantic understanding, but at the cost of interpretability. Beyond the interpretability there may be other issues with LLM evaluators.[2] For instance, if an LLM has generated an output, the evaluation of the output with the same LLM may yield a distorted evaluation, "LLM narcissism".[2][3]

Typically, a more powerful LLM is employed to evaluate the outputs of smaller or less capable language models—for example, using GPT-4 to assess the performance of a 13-billion-parameter LLaMA model.[4] Recent research has also explored leveraging multiple LLM evaluators to improve fairness and scalability,[5] and the idea of “LLM juries” has been proposed as a practical mechanism to mitigate bias.[6]

See also

[edit]

References

[edit]
  1. ^ Hendriksen, Mariya; Rashid, Tabish; Bignell, David; Georgescu, Raluca; Lemkhenter, Abdelhak; Hofmann, Katja; Devlin, Sam; Parisot, Sarah (2025). "Adapting Vision-Language Models for Evaluating World Models". arXiv preprint. arXiv:2506.17967.
  2. ^ a b Laura Dietz; Oleg Zendel; Peter Bailey; et al. (18 July 2025). Principles and Guidelines for the Use of LLM Judges. pp. 218–229. doi:10.1145/3731120.3744588. ISBN 979-8-4007-1861-8. Wikidata Q135734265. {{cite book}}: |journal= ignored (help)
  3. ^ Yiqi Liu; Nafise Moosavi; Chenghua Lin (2024), LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores, pp. 12688–12701, doi:10.18653/V1/2024.FINDINGS-ACL.753, Wikidata Q135734850
  4. ^ Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; et al. (9 June 2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685, doi:10.48550/ARXIV.2306.05685, Wikidata Q123527686
  5. ^ "Leveraging Multiple LLM Evaluators for Scalable and Fair Language Model Assessments". ResearchGate. 2025. Retrieved 18 September 2025.
  6. ^ "LLM Juries for Evaluation". Comet. 2025. Retrieved 18 September 2025.