Evaluate Your Agent
TerminalWorld provides the task dataset. Evaluation runs through Harbor — an open framework for benchmarking agents on terminal tasks in isolated Docker environments.
How it fits together
01 Set up Harbor
Harbor handles Docker environment management, agent scaffolding, and automated grading. Follow the official setup guide to get it running:
Harbor setup guide ↗02 Load the TerminalWorld dataset
The task dataset is hosted on HuggingFace. Point Harbor at it when configuring your benchmark run.
$ huggingface-cli download EuniAI/TerminalWorld \
--repo-type dataset \
--local-dir ./terminalworld-tasks 03 Run your agent
Use Harbor to run your agent against the TerminalWorld tasks. Each task executes in an isolated Docker container with automated pass/fail grading. See the Harbor docs for agent interface details.
# Example — refer to Harbor docs for exact flags
$ harbor eval \
--benchmark ./terminalworld-tasks \
--agent your_agent.py \
--output results.json
Running 1,530 tasks across 18 categories...
✓ Complete — pass rate: 62.5% 04 Submit to the leaderboard
Got results? Open a pull request to
data/leaderboard.json
in the GitHub repo. Include your results.json
and model details. Results are verified and published within 48 hours.