Evaluating Large Language Models

Posted by

Preeth P

October 15, 2025

On March 4, 2024

Welcome to the first article in our three-part series titled “Evaluating Large Language Models”.

In this inaugural article, we embark on a journey to explore the evolution and significance of benchmarking intelligence, with a special focus on Large Language Models (LLMs). We delve into the history of intelligence benchmarks and how these metrics have been adapted to assess the capabilities and performance of LLMs. This series aims to shed light on the multifaceted approach to evaluating these sophisticated models, from established benchmarks to innovative evaluation methods.

In our subsequent articles, we will dive deeper into the most recognized benchmarks in the field, providing detailed analyses and examples. The final article will cover various evaluation methods, including both quantitative metrics like Exact Match (EM), F1 score, and Expected Calibration Error (ECE), as well as qualitative considerations such as fairness and robustness. We will also explore the role of human evaluation and the potential of using GPT-4 itself as a judge. Stay with us as we navigate through the intricate landscape of Large Language Model evaluation.

Introduction to Machine Intelligence Evaluation

When we consider intelligence, human intelligence naturally is the first option for our benchmark, requiring diverse methods for measurement and evaluation. The understanding of human intelligence contains numerous approaches, including IQ tests and cognitive games. Throughout history, our continuous efforts have been towards comprehending, assessing, and pushing toward the boundaries of various facets of human intelligence. However, with the new information age, a new dimension of human intelligence is emerging, sparking interest in both academia and industry, machine intelligence, in the field of Natural Language Processing (NLP) termed as Language models. These language models are typically formulated with Deep Neural Networks (DNNs) that possess incredible language comprehension and generation capabilities. This leads to the question of how to measure and assess the level of this type of intelligence.

Evolution of NLP Benchmarks

In the earlier stages of NLP, researchers have considered straightforward benchmark tests to evaluate their language model. In the late 1900s, MUC evaluation primarily relates to information extraction texts, challenging systems to extract specific information from text. With the rise of DNNs in the 2010s, the NLP community evolved with more expansive benchmarks like SNLI, and SQUAD . With the rise of large-scale pre-trained models, namely BERT, GPT based models, evaluation methods needed to adapt to assess the performance of these new types of generated models. This led to the rise of the following benchmarks for evaluation, including but not limited to; XNLI, CoNLL, SemEval, GLUE, and SuperGLUE. These benchmarks fostered continuous refinement in NLP evaluation methods, to compare the capabilities of these diverse systems.

With the rise of expansion of the size of language models, large language models (LLMs) have gained incredible traction and noteworthy performance on any given task, under both zero and few shot settings, rivaling again fine-tuned pre-trained models. This led to a landscape of evaluation benchmarks to evaluate knowledge, reasoning, truthfulness, and various other capabilities of these systems. The rapid adoption of ChatGPT , where 100 million users were able to experiment in just 2 months. This sudden growth lead to the use of LLMs in various tasks including text generation, code generation, and question answering systems. However, with their rise, also led to further questions about the potential risks of deploying these LLMs at scale without thorough and comprehensive evaluation. Critical issues such as perpetuating bias, spreading misinformation, and compromising privacy data, need to be addressed. In response to this, led to the line of research with a focus on empirically evaluating the extent to which LLMs align with human preferences and values. While previously mentioned evaluations have focused predominantly on the capabilities of the systems, the new research aims to steer the advancement and capabilities of LLMs in a way that maximizes their benefits by also proactively mitigating risks.

Picture title: Major categories and subcategories of LLM evaluationRef of image
Source : (https://arxiv.org/pdf/2310.19736.pdf)

Evaluating LLMs: Benchmarks and Challenges

While numerous benchmarks have been developed to evaluate LLM’s capabilities and alignment with human values, they often compare performance on singular tasks (sentence analysis, text generation..) or specific domains. It is of utmost importance to evaluate LLMs for their success due to several factors.

First, assessing LLMs allows for a deeper understanding of their strengths and limitations. For example, PromptBench has revealed that LLMs are vulnerable to adversarial prompts [reference], indicating the necessity of refined prompts for improved performance.
Second, improved evaluations can offer guidance for interactions between LLMs and humans.
Third, given LLM’s capability of a wide range of applications, it is crucial to ensure their safety and reliability, especially where safety is paramount, like finance and healthcare.
Lastly, due to their continuously evolving in size and developing new capabilities, current evaluation methods may not sufficiently assess their abilities and associated risks.

Overview of Popular LLM Benchmarks

Table gives an overview of the open sourced popular benchmarks that are available to evaluate, depending on focus and different domains.

Benchmark	Focus	Domain	Evaluation Criteria
CUAD	Legal contract review	General language task	Legal contract understanding
MMLU	Text models	Genrate langage task	Multitask accuracy
TRUSTGPT	Ethics	Specific down stream task	Toxicity, bias and value – alignment
OpenLLM	Chatbots	General language task	Leaderboard ranking
Chatbot arena	Chat assistants	General language task	Crowdsourcing and Elo rating system
Alpaca Eval	Automated Evaluation	General language task	Metrics, robustness and diversity
ToolBench	Software tools	Specific downstream tasks	Exexcution success rate
FreshQA	Dynamic QA	Spe	Correctness and hallucination
PromptBench	Adversiral prompt resilience	General Language task	Adversarial robustness
MT-Bench	Multi turn conversation	General language task	Winrate judged by GPT-4
LLMEval	LLM Evaluator	Genral language task	Acc, macro-f1 and kappa correlation coefficient

Summary

In this article, we looked at the history of benchmarking intelligence in general, pivoting towards benchmarks in LLMs. In the next article, we will look at the popular benchmarks as illustrated in the table in detail with a few examples. In the third article, we will look at different methods to evaluate (hint: Automatic evaluation using Exact Match (EM), F1 score, confidence level metrics such as Expected Calibration Error (ECE), Area Under the Curve of selective accuracy and coverage (AUC) and finally qualitative metrics such as fairness, robustness. Other methods also include human evaluation or GPT-4 as a judge)

Referred papers:

A Survey on Evaluation of Large Language Models [https://arxiv.org/pdf/2307.03109.pdf]
PromptBench: [https://arxiv.org/pdf/2312.07910v1.pdf]
TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT [https://arxiv.org/pdf/2308.05374.pdf]
Evaluating Large Language Models: A Comprehensive Survey [https://arxiv.org/pdf/2310.19736.pdf]