QUICKSTART

Evaluate Your Agent

TerminalWorld provides the task dataset. Evaluation runs through Harbor — an open framework for benchmarking agents on terminal tasks in isolated Docker environments.

How it fits together

TerminalWorld 1,530 tasks

Harbor framework eval engine

→

pass rate results

01 Set up Harbor

Harbor handles Docker environment management, agent scaffolding, and automated grading. Follow the official setup guide to get it running:

Harbor setup guide ↗

02 Run your agent

Harbor pulls the TerminalWorld dataset directly from HuggingFace. Pass your agent with -a and set parallelism with -n. Each task executes in an isolated Docker container with automated pass/fail grading. See the Harbor docs for all available flags and agent interface details.

terminal

$ harbor run \
    -d EuniAI/TerminalWorld \
    -m your_model \
    -a your_agent \
    -n 32

03 Submit to the leaderboard

Got results? Open a pull request to data/leaderboard.json in the GitHub repo. Include your results.json and model details. Results are verified and published within a week.

Open a PR on GitHub ↗

More resources

Harbor Docs ↗

Evaluation framework

Dataset ↗

All 1,530 tasks on HF

Leaderboard →

Current results