huggingface-community-evals

安装量: 219
排名: #9676

安装

npx skills add https://github.com/huggingface/skills --skill huggingface-community-evals
Overview
This skill is for
running evaluations against models on the Hugging Face Hub on local hardware
.
It covers:
inspect-ai
with local inference
lighteval
with local inference
choosing between
vllm
, Hugging Face Transformers, and
accelerate
smoke tests, task selection, and backend fallback strategy
It does
not
cover:
Hugging Face Jobs orchestration
model-card or
model-index
edits
README table extraction
Artificial Analysis imports
.eval_results
generation or publishing
PR creation or community-evals automation
If the user wants to
run the same eval remotely on Hugging Face Jobs
, hand off to the
hugging-face-jobs
skill and pass it one of the local scripts in this skill.
If the user wants to
publish results into the community evals workflow
, stop after generating the evaluation run and hand off that publishing step to
~/code/community-evals
.
All paths below are relative to the directory containing this
SKILL.md
.
When To Use Which Script
Use case
Script
Local
inspect-ai
eval on a Hub model via inference providers
scripts/inspect_eval_uv.py
Local GPU eval with
inspect-ai
using
vllm
or Transformers
scripts/inspect_vllm_uv.py
Local GPU eval with
lighteval
using
vllm
or
accelerate
scripts/lighteval_vllm_uv.py
Extra command patterns
examples/USAGE_EXAMPLES.md
Prerequisites
Prefer
uv run
for local execution.
Set
HF_TOKEN
for gated/private models.
For local GPU runs, verify GPU access before starting:
uv
--version
printenv
HF_TOKEN
>
/dev/null
nvidia-smi
If
nvidia-smi
is unavailable, either:
use
scripts/inspect_eval_uv.py
for lighter provider-backed evaluation, or
hand off to the
hugging-face-jobs
skill if the user wants remote compute.
Core Workflow
Choose the evaluation framework.
Use
inspect-ai
when you want explicit task control and inspect-native flows.
Use
lighteval
when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
Choose the inference backend.
Prefer
vllm
for throughput on supported architectures.
Use Hugging Face Transformers (
--backend hf
) or
accelerate
as compatibility fallbacks.
Start with a smoke test.
inspect-ai
add
--limit 10
or similar.
lighteval
add --max-samples 10 . Scale up only after the smoke test passes. If the user wants remote execution, hand off to hugging-face-jobs with the same script + args. Quick Start Option A: inspect-ai with local inference providers path Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead. uv run scripts/inspect_eval_uv.py \ --model meta-llama/Llama-3.2-1B \ --task mmlu \ --limit 20 Use this path when: you want a quick local smoke test you do not need direct GPU control the task already exists in inspect-evals Option B: inspect-ai on Local GPU Best when you need to load the Hub model directly, use vllm , or fall back to Transformers for unsupported architectures. Local GPU: uv run scripts/inspect_vllm_uv.py \ --model meta-llama/Llama-3.2-1B \ --task gsm8k \ --limit 20 Transformers fallback: uv run scripts/inspect_vllm_uv.py \ --model microsoft/phi-2 \ --task mmlu \ --backend hf \ --trust-remote-code \ --limit 20 Option C: lighteval on Local GPU Best when the task is naturally expressed as a lighteval task string, especially Open LLM Leaderboard style benchmarks. Local GPU: uv run scripts/lighteval_vllm_uv.py \ --model meta-llama/Llama-3.2-3B-Instruct \ --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \ --max-samples 20 \ --use-chat-template accelerate fallback: uv run scripts/lighteval_vllm_uv.py \ --model microsoft/phi-2 \ --tasks "leaderboard|mmlu|5" \ --backend accelerate \ --trust-remote-code \ --max-samples 20 Remote Execution Boundary This skill intentionally stops at local execution and backend selection . If the user wants to: run these scripts on Hugging Face Jobs pick remote hardware pass secrets to remote jobs schedule recurring runs inspect / cancel / monitor jobs then switch to the hugging-face-jobs skill and pass it one of these scripts plus the chosen arguments. Task Selection inspect-ai examples: mmlu gsm8k hellaswag arc_challenge truthfulqa winogrande humaneval lighteval task strings use suite|task|num_fewshot : leaderboard|mmlu|5 leaderboard|gsm8k|5 leaderboard|arc_challenge|25 lighteval|hellaswag|0 Multiple lighteval tasks can be comma-separated in --tasks . Backend Selection Prefer inspect_vllm_uv.py --backend vllm for fast GPU inference on supported architectures. Use inspect_vllm_uv.py --backend hf when vllm does not support the model. Prefer lighteval_vllm_uv.py --backend vllm for throughput on supported models. Use lighteval_vllm_uv.py --backend accelerate as the compatibility fallback. Use inspect_eval_uv.py when Inference Providers already cover the model and you do not need direct GPU control. Hardware Guidance Model size Suggested local hardware < 3B consumer GPU / Apple Silicon / small dev GPU 3B - 13B stronger local GPU 13B+ high-memory local GPU or hand off to hugging-face-jobs For smoke tests, prefer cheaper local runs plus --limit or --max-samples . Troubleshooting CUDA or vLLM OOM: reduce --batch-size reduce --gpu-memory-utilization switch to a smaller model for the smoke test if necessary, hand off to hugging-face-jobs Model unsupported by vllm : switch to --backend hf for inspect-ai switch to --backend accelerate for lighteval Gated/private repo access fails: verify HF_TOKEN Custom model code required: add --trust-remote-code Examples See: examples/USAGE_EXAMPLES.md for local command patterns scripts/inspect_eval_uv.py scripts/inspect_vllm_uv.py scripts/lighteval_vllm_uv.py
返回排行榜