Newsletters and Articles

Evaluating Large Language Models

https://www.enkefalos.com/newsletters-and-articles/evaluating-large-language-models/ Evaluating LLMs

Welcome to the first article in our three-part series titled “Evaluating Large Language Models”.

Picture title: Major categories and subcategories of LLM evaluationRef of image
Source : (https://arxiv.org/pdf/2310.19736.pdf)

BenchmarkFocusDomainEvaluation Criteria
CUADLegal contract reviewGeneral language taskLegal contract understanding
MMLUText modelsGenrate langage taskMultitask accuracy
TRUSTGPTEthicsSpecific down stream taskToxicity, bias and value – alignment
OpenLLMChatbotsGeneral language taskLeaderboard ranking
Chatbot arenaChat assistantsGeneral language taskCrowdsourcing and Elo rating system
Alpaca EvalAutomated EvaluationGeneral language taskMetrics, robustness and diversity
ToolBenchSoftware toolsSpecific downstream tasksExexcution success rate
FreshQADynamic QASpeCorrectness and hallucination
PromptBenchAdversiral prompt resilienceGeneral Language taskAdversarial robustness
MT-BenchMulti turn conversationGeneral language taskWinrate judged by GPT-4 
LLMEvalLLM EvaluatorGenral language taskAcc, macro-f1 and kappa correlation coefficient
author-avatar

About Preeth P

Machine Learning Engineer