Leaderboard

Performance of frontier models on TerminalWorld. Evaluated using the Terminus-2 agent framework. Updated May 21, 2026.

Verified Subset

200 tasks

Manually verified

Agent Framework

Terminus-2

Standardized scaffold from Terminal-Bench

Eval Harness

Harbor

Terminal-Bench evaluation harness

Verified Split Results

200 tasks

Rank	Model	Provider	Agent	Score	Solved / Total	Date	Total Cost	Cost / Pass
1	Claude Opus 4.7	Anthropic	Terminus-2	62.5%	125/200	May 21, 2026	$63.47	$0.51
2	Kimi K2.6	Moonshot AI	Terminus-2	57.5%	115/200	May 21, 2026	$17.68	$0.15
3	GLM 5.1	Z.ai	Terminus-2	57.0%	114/200	May 21, 2026	$18.24	$0.16
4	Gemini 3.1 Pro	Google	Terminus-2	55.0%	110/200	May 21, 2026	$56.82	$0.52
5	Qwen3.6-Max-Preview	Alibaba	Terminus-2	54.0%	108/200	May 21, 2026	$21.44	$0.20
6	GPT-5.5	OpenAI	Terminus-2	53.5%	107/200	May 21, 2026	$100.28	$0.94
7	DeepSeek-V4-Pro	DeepSeek	Terminus-2	50.0%	100/200	May 21, 2026	$17.35	$0.17
8	MiniMax M2.7	MiniMax	Terminus-2	49.0%	98/200	May 21, 2026	$10.95	$0.11

All models evaluated at default thinking/reasoning effort.

Submit Your Model

Evaluated your model on TerminalWorld? Open a pull request to add your results. See the submission guide for required fields and evaluation instructions.

Submission Guide Open Pull Request ↗

FAQ

How does TerminalWorld differ from Terminal-Bench?

Both benchmarks evaluate terminal agents, but they measure different things. Terminal-Bench tasks are hand-crafted by researchers, which means they reflect what researchers think developers do rather than what developers actually do, and they cannot keep up with the pace at which real-world tooling evolves. TerminalWorld tasks are derived from 80,870 real developer recordings on asciinema, capturing authentic workflows by construction. The Pearson correlation between the two is only 0.20, indicating they assess distinct capabilities.

Are the tasks actually solvable?

Yes. Every task originates from a real developer who completed it successfully, so the tasks are solvable by construction. We also validate each task before inclusion by running the reference solution through three rounds of automated testing to confirm the test suite is correct. The best current result is 62.5% (Claude Opus 4.7). No model has reached 100%, which reflects genuine difficulty rather than broken evaluation.

Why are pass rates relatively low?

TerminalWorld covers 1,280 unique commands across 18 categories, reflecting the authentic diversity of real developer workflows. Many of these tools and patterns are absent from curated benchmarks, so models encounter genuinely unfamiliar territory. Performance drops sharply in demanding domains like performance optimization (28.1%) and scripting (39.1%). The tasks were not designed to be easy; they were designed to be real.

How is the leaderboard sorted?

By pass rate on the 200 human-verified tasks (TerminalWorld-Verified). This subset was manually reviewed and cross-verified by four authors, making it the highest-fidelity evaluation core. All models are evaluated using the Terminus-2 agent framework at default thinking/reasoning effort settings.

Can agents cheat?

Each task runs inside a sandboxed Docker container. Agents cannot browse the web or consult external resources to look up solutions. The evaluation tests what the model actually knows, not its ability to search for answers. Reference solutions are not publicly released, and tasks are graded on final system state rather than specific commands, so agents are free to reach the goal by any valid path. Each task page also embeds a harbor-canary identifier to help detect if task content is scraped for training.

How do I submit my model?

Run your model using the Harbor evaluation harness (see the Quickstart page for step-by-step instructions), then open a pull request to the GitHub repository with your results. Results are verified and published within a week.