Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.



411 University St, Seattle, USA


+1 -800-456-478-23

Newsletters and Articles
LLM Model Evaluation Evaluation Metrics

Evaluating Large Language Models – Evaluation Metrics

Current major applications of LLMs
Current major applications of LLMs – https://arxiv.org/pdf/2308.05374.pdf
Automatic Evaluation Metrics:
  1. Exact Match (EM): In the LLM setting, EM measures the percentage of generated texts that exactly match the reference texts
    It's calculated as:
    EM = (Number of exactly matching generated texts) / (Total number of generated texts)
    For example, if an LLM generates 100 sentences and 20 of them exactly match the corresponding reference sentences, the EM score would be 20/100 = 0.2 or 20%

  2. F1 Score: To compute the F1 Score for LLM evaluation, we need to define precision and recall at the token level. Precision measures the proportion of generated tokens that match the reference tokens, while recall measures the proportion of reference tokens that are captured by the generated tokens.
    Precision = (Number of matching tokens in generated text) / (Total number of tokens in generated text)Recall = (Number of matching tokens in generated text) / (Total number of tokens in reference text)
    F1 = 2 * (Precision * Recall) / (Precision + Recall)
    For example, let's say the LLM generates the sentence "The quick brown fox jumps over the lazy dog" and the reference sentence is "A quick brown fox jumps over the lazy dog". The precision would be 7/9 (7 matching tokens out of 9 generated tokens), and the recall would be 7/10 (7 matching tokens out of 10 reference tokens). The resulting F1 score would be 2 * (7/9 * 7/10) / (7/9 + 7/10) ≈ 0.778.

  3. BLEU (Bilingual Evaluation Understudy): BLEU is a widely used metric for evaluating machine translation and text generation systems]. It calculates the geometric mean of n-gram precision scores (usually up to 4-grams) and applies a brevity penalty to penalize short-generated texts. The BLEU score ranges from 0 to 1, with higher values indicating better performance.
  4. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a set of metrics commonly used for evaluating automatic summarization. It measures the overlap of n-grams (usually unigrams and bigrams) between the generated summary and reference summaries. The main variants are ROUGE-N (n-gram recall), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigram co-occurrence).
  5. METEOR (Metric for Evaluation of Translation with Explicit Ordering): METEOR is another metric used for evaluating machine translation and text generation. It considers not only exact word matches but also stemming, synonyms, and paraphrases. METEOR computes a weighted harmonic mean of precision and recall, giving more importance to recall. It also includes a fragmentation penalty to favour longer consecutive matches.
Confidence Level Metrics:
  1. Expected Calibration Error (ECE): ECE measures the difference between a model’s confidence and its actual accuracy. It’s calculated by partitioning predictions into bins based on confidence and computing the weighted average of the difference between average confidence and accuracy in each bin.
  2. Area Under the Curve (AUC): AUC evaluates the model’s ability to discriminate between classes. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. A higher AUC indicates better performance.
  1. Fairness: Ensuring the model treats all demographics equally is crucial. Techniques like counterfactual fairness and equalized odds help assess and reduce bias.
  2. Robustness: Models should always maintain performance under distribution shifts or adversarial attacks. Metrics like accuracy on perturbed inputs and adversarial accuracy help in robustness.
Human Evaluation:
  1. Likert Scale Ratings: Annotators rate the model’s outputs on a scale (e.g., 1-5) for qualities like fluency, coherence, and relevance.
  2. Comparative Evaluation: Judges compare outputs from different models, choosing the better one.
  3. A/B Testing: Users interact with the model in real-world scenarios, providing feedback on their experience.
GPT-4 as a Judge:

For example, the prompt

Evaluate Coherence in the Summarization Task 
You will be given one summary written for a news article.
Your task is to rate the summary on one metric. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:
Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic."

Evaluation Steps:
Read the news article carefully and identify the main topic and key points.
Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.

Source Text: {{Document}}
Summary: {{Summary}}
Evaluation Form (scores ONLY):

  1. McNemar’s Test: This test is used for comparing two models on a binary classification task. It considers the discordant pairs (cases where the models disagree) and calculates a test statistic based on the chi-squared distribution.
  2. Wilcoxon Signed-Rank Test: For evaluating models on continuous or ordinal metrics (e.g., perplexity, BLEU), the Wilcoxon signed-rank test is appropriate. It compares the ranks of the differences between paired observations.
  3. Bootstrap Resampling: Bootstrapping involves repeatedly sampling from the test set with replacement to create multiple subsamples. The models are evaluated on each subsample, and the distribution of the evaluation metric is analyzed to estimate confidence intervals and assess significance.
  1. Grammaticality: Tools like the Corpus of Linguistic Acceptability or the Grammaticality Judgment Dataset can be used to assess the grammatical correctness of generated sentences.
  2. Coherence: Metrics such as the Entity Grid or the Discourse Coherence Model evaluate the coherence and logical flow of generated text by analyzing entity transitions and discourse relations.
  3. Diversity: Measuring the diversity of generated text helps ensure that the model is not simply memorizing and reproducing training data. Metrics like Self-BLEU or Distinct-N quantify the uniqueness of generated tokens or n-grams.
  1. Online Learning: Incorporating user feedback and interactions into the model’s training process allows for continuous improvement. Techniques like active learning and reinforcement learning can be employed to update the model based on real-world data.
  2. Concept Drift Detection: Monitoring the model’s performance for concept drift helps identify when the data distribution has shifted, and the model’s predictions become less accurate. Techniques like adaptive windowing and ensemble learning can help detect and mitigate concept drift.
  3. Explainable AI: Providing explanations for the model’s predictions enhances transparency and trust. Techniques like attention visualization [39] and feature importance analysis can help users understand the factors influencing the model’s outputs.


Preeth P

Machine Learning Engineer

Leave a comment

Your email address will not be published. Required fields are marked *