home/categories/academic/davila7-claude-code-templates-cli-tool-components-skills-ai-research-evaluation-lm-evaluation-harness-skill-md
academicresearch
evaluating-llms-harness
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
maintainer
davila7
Updated 1/20/2026
Stars
17577
Forks
1576
quick start
Installation and usage
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Installation
$ install --globalskills.sh
Usage
Once installed, you can use this skill by running the following command in your terminal:
skills use evaluating-llms-harness