AI, Newsletters and Articles

Evaluating Large Language Models

https://www.enkefalos.com/newsletters-and-articles/evaluating-large-language-models/ Evaluating LLMs

Welcome to the first article in our three-part series titled “Evaluating Large Language Models”.

Picture title: Major categories and subcategories of LLM evaluationRef of image
Source : (https://arxiv.org/pdf/2310.19736.pdf)

Benchmark Focus Domain Evaluation Criteria
CUAD Legal contract review General language task Legal contract understanding
MMLU Text models Genrate langage task Multitask accuracy
TRUSTGPT Ethics Specific down stream task Toxicity, bias and value – alignment
OpenLLM Chatbots General language task Leaderboard ranking
Chatbot arena Chat assistants General language task Crowdsourcing and Elo rating system
Alpaca Eval Automated Evaluation General language task Metrics, robustness and diversity
ToolBench Software tools Specific downstream tasks Exexcution success rate
FreshQA Dynamic QA Spe Correctness and hallucination
PromptBench Adversiral prompt resilience General Language task Adversarial robustness
MT-Bench Multi turn conversation General language task Winrate judged by GPT-4 
LLMEval LLM Evaluator Genral language task Acc, macro-f1 and kappa correlation coefficient
author-avatar

About Preeth P

Machine Learning Engineer