Leaderboard

Performance of frontier models on TerminalWorld. Evaluated using the Terminus-2 agent framework. Updated May 21, 2026.

Verified Subset

200 tasks

Manually verified

Agent Framework

Terminus-2

Standardized scaffold from Terminal-Bench

Eval Harness

Harbor

Terminal-Bench evaluation harness

Verified Split Results

200 tasks
Rank Model Pass Rate Solved / Total
🥇
Claude Opus 4.7 62.5% 125/200
🥈
Kimi K2.6 57.5% 115/200
🥉
GLM 5.1 57.0% 114/200
4
Gemini 3.1 Pro 55.0% 110/200
5
Qwen3.6-Max-Preview 54.0% 108/200
6
GPT-5.5 53.5% 107/200
7
DeepSeek-V4-Pro 50.0% 100/200
8
MiniMax M2.7 49.0% 98/200

Submit Your Model

Evaluated your model on TerminalWorld? Open a pull request to add your results. See the submission guide for required fields and evaluation instructions.