Overview

This skill is for

running evaluations against models on the Hugging Face Hub on local hardware

.

It covers:

inspect-ai

with local inference

lighteval

with local inference

choosing between

vllm

, Hugging Face Transformers, and

accelerate

smoke tests, task selection, and backend fallback strategy

It does

not

cover:

Hugging Face Jobs orchestration

model-card or

model-index

edits

README table extraction

Artificial Analysis imports

.eval_results

generation or publishing

PR creation or community-evals automation

If the user wants to

run the same eval remotely on Hugging Face Jobs

, hand off to the

hugging-face-jobs

skill and pass it one of the local scripts in this skill.

If the user wants to

publish results into the community evals workflow

, stop after generating the evaluation run and hand off that publishing step to

~/code/community-evals

.

All paths below are relative to the directory containing this

SKILL.md

.

When To Use Which Script

Use case

Script

Local

inspect-ai

eval on a Hub model via inference providers

scripts/inspect_eval_uv.py

Local GPU eval with

inspect-ai

using

vllm

or Transformers

scripts/inspect_vllm_uv.py

Local GPU eval with

lighteval

using

vllm

or

accelerate

scripts/lighteval_vllm_uv.py

Extra command patterns

examples/USAGE_EXAMPLES.md

Prerequisites

Prefer

uv run

for local execution.

Set

HF_TOKEN

for gated/private models.

For local GPU runs, verify GPU access before starting:

uv

--version

printenv

HF_TOKEN

>

/dev/null

nvidia-smi

If

nvidia-smi

is unavailable, either:

use

scripts/inspect_eval_uv.py

for lighter provider-backed evaluation, or

hand off to the

hugging-face-jobs

skill if the user wants remote compute.

Core Workflow

Choose the evaluation framework.

Use

inspect-ai

when you want explicit task control and inspect-native flows.

Use

lighteval

when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.

Choose the inference backend.

Prefer

vllm

for throughput on supported architectures.

Use Hugging Face Transformers (

--backend hf

) or

accelerate

as compatibility fallbacks.

Start with a smoke test.

inspect-ai

add
--limit 10
or similar.
lighteval: add --max-samples 10 . Scale up only after the smoke test passes. If the user wants remote execution, hand off to hugging-face-jobs with the same script + args. Quick Start Option A: inspect-ai with local inference providers path Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead. uv run scripts/inspect_eval_uv.py \ --model meta-llama/Llama-3.2-1B \ --task mmlu \ --limit 20 Use this path when: you want a quick local smoke test you do not need direct GPU control the task already exists in inspect-evals Option B: inspect-ai on Local GPU Best when you need to load the Hub model directly, use vllm , or fall back to Transformers for unsupported architectures. Local GPU: uv run scripts/inspect_vllm_uv.py \ --model meta-llama/Llama-3.2-1B \ --task gsm8k \ --limit 20 Transformers fallback: uv run scripts/inspect_vllm_uv.py \ --model microsoft/phi-2 \ --task mmlu \ --backend hf \ --trust-remote-code \ --limit 20 Option C: lighteval on Local GPU Best when the task is naturally expressed as a lighteval task string, especially Open LLM Leaderboard style benchmarks. Local GPU: uv run scripts/lighteval_vllm_uv.py \ --model meta-llama/Llama-3.2-3B-Instruct \ --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \ --max-samples 20 \ --use-chat-template accelerate fallback: uv run scripts/lighteval_vllm_uv.py \ --model microsoft/phi-2 \ --tasks "leaderboard|mmlu|5" \ --backend accelerate \ --trust-remote-code \ --max-samples 20 Remote Execution Boundary This skill intentionally stops at local execution and backend selection . If the user wants to: run these scripts on Hugging Face Jobs pick remote hardware pass secrets to remote jobs schedule recurring runs inspect / cancel / monitor jobs then switch to the hugging-face-jobs skill and pass it one of these scripts plus the chosen arguments. Task Selection inspect-ai examples: mmlu gsm8k hellaswag arc_challenge truthfulqa winogrande humaneval lighteval task strings use suite|task|num_fewshot : leaderboard|mmlu|5 leaderboard|gsm8k|5 leaderboard|arc_challenge|25 lighteval|hellaswag|0 Multiple lighteval tasks can be comma-separated in --tasks . Backend Selection Prefer inspect_vllm_uv.py --backend vllm for fast GPU inference on supported architectures. Use inspect_vllm_uv.py --backend hf when vllm does not support the model. Prefer lighteval_vllm_uv.py --backend vllm for throughput on supported models. Use lighteval_vllm_uv.py --backend accelerate as the compatibility fallback. Use inspect_eval_uv.py when Inference Providers already cover the model and you do not need direct GPU control. Hardware Guidance Model size Suggested local hardware < 3B consumer GPU / Apple Silicon / small dev GPU 3B - 13B stronger local GPU 13B+ high-memory local GPU or hand off to hugging-face-jobs For smoke tests, prefer cheaper local runs plus --limit or --max-samples . Troubleshooting CUDA or vLLM OOM: reduce --batch-size reduce --gpu-memory-utilization switch to a smaller model for the smoke test if necessary, hand off to hugging-face-jobs Model unsupported by vllm : switch to --backend hf for inspect-ai switch to --backend accelerate for lighteval Gated/private repo access fails: verify HF_TOKEN Custom model code required: add --trust-remote-code Examples See: examples/USAGE_EXAMPLES.md for local command patterns scripts/inspect_eval_uv.py scripts/inspect_vllm_uv.py scripts/lighteval_vllm_uv.py

huggingface-community-evals

安装