home/categories/debugging/orchestra-research-ai-research-skills-11-evaluation-lm-evaluation-harness-skill-md
debuggingtools
evaluating-llms-harness
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
maintainer
Orchestra-Research
آخر تحديث 11/20/2025
النجوم
6563
التفرعات
515
quick start
Installation and usage
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
التثبيت
$ install --globalskills.sh
الاستخدام
بعد التثبيت، يمكنك استخدام هذه المهارة بتشغيل الأمر التالي في الطرفية:
skills use evaluating-llms-harness