Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Newsletters and Articles
Evaluating LLMs

Evaluating Large Language Models

Welcome to the first article in our three-part series titled “Evaluating Large Language Models”.

Picture title: Major categories and subcategories of LLM evaluationRef of image
Source : (https://arxiv.org/pdf/2310.19736.pdf)

BenchmarkFocusDomainEvaluation Criteria
CUADLegal contract reviewGeneral language taskLegal contract understanding
MMLUText modelsGenrate langage taskMultitask accuracy
TRUSTGPTEthicsSpecific down stream taskToxicity, bias and value – alignment
OpenLLMChatbotsGeneral language taskLeaderboard ranking
Chatbot arenaChat assistantsGeneral language taskCrowdsourcing and Elo rating system
Alpaca EvalAutomated EvaluationGeneral language taskMetrics, robustness and diversity
ToolBenchSoftware toolsSpecific downstream tasksExexcution success rate
FreshQADynamic QASpeCorrectness and hallucination
PromptBenchAdversiral prompt resilienceGeneral Language taskAdversarial robustness
MT-BenchMulti turn conversationGeneral language taskWinrate judged by GPT-4 
LLMEvalLLM EvaluatorGenral language taskAcc, macro-f1 and kappa correlation coefficient

Author

Preeth P

Machine Learning Engineer

Leave a comment

Your email address will not be published. Required fields are marked *

Translate