QUICKSTART

Evaluate Your Agent

TerminalWorld provides the task dataset. Evaluation runs through Harbor — an open framework for benchmarking agents on terminal tasks in isolated Docker environments.

How it fits together

TerminalWorld 1,530 tasks
Harbor framework eval engine
pass rate results

01 Set up Harbor

Harbor handles Docker environment management, agent scaffolding, and automated grading. Follow the official setup guide to get it running:

Harbor setup guide ↗

02 Load the TerminalWorld dataset

The task dataset is hosted on HuggingFace. Point Harbor at it when configuring your benchmark run.

terminal
$ huggingface-cli download EuniAI/TerminalWorld \
    --repo-type dataset \
    --local-dir ./terminalworld-tasks
Browse the dataset on HuggingFace ↗

03 Run your agent

Use Harbor to run your agent against the TerminalWorld tasks. Each task executes in an isolated Docker container with automated pass/fail grading. See the Harbor docs for agent interface details.

terminal
# Example — refer to Harbor docs for exact flags
$ harbor eval \
    --benchmark ./terminalworld-tasks \
    --agent your_agent.py \
    --output results.json

Running 1,530 tasks across 18 categories...
✓ Complete  —  pass rate: 62.5%

04 Submit to the leaderboard

Got results? Open a pull request to data/leaderboard.json in the GitHub repo. Include your results.json and model details. Results are verified and published within 48 hours.

Open a PR on GitHub ↗