Leaderboard
Performance of frontier models on TerminalWorld. Evaluated using the Terminus-2 agent framework. Updated May 21, 2026.
Verified Subset
200 tasks
Manually verified
Agent Framework
Terminus-2
Standardized scaffold from Terminal-Bench
Eval Harness
Harbor
Terminal-Bench evaluation harness
Verified Split Results
200 tasks| Rank | Model | Pass Rate | Solved / Total |
|---|---|---|---|
| 🥇 | Claude Opus 4.7 | 62.5% | 125/200 |
| 🥈 | Kimi K2.6 | 57.5% | 115/200 |
| 🥉 | GLM 5.1 | 57.0% | 114/200 |
| 4 | Gemini 3.1 Pro | 55.0% | 110/200 |
| 5 | Qwen3.6-Max-Preview | 54.0% | 108/200 |
| 6 | GPT-5.5 | 53.5% | 107/200 |
| 7 | DeepSeek-V4-Pro | 50.0% | 100/200 |
| 8 | MiniMax M2.7 | 49.0% | 98/200 |
Submit Your Model
Evaluated your model on TerminalWorld? Open a pull request to add your results. See the submission guide for required fields and evaluation instructions.