TerminalWorld
Real-World Tasks. Real Impact.
Benchmark AI agents on the terminal workflows developers run every day. Live as practices evolve.
1,530
Full Benchmark
200
Human Verified
18
Task categories
1,280
Unique commands
Leaderboard
Updated May 21, 2026
Claude Opus 4.7
Anthropic
Kimi K2.6
Moonshot AI
GLM 5.1
Z.ai
Gemini 3.1 Pro
Qwen3.6-Max-Preview
Alibaba
GPT-5.5
OpenAI
DeepSeek-V4-Pro
DeepSeek
MiniMax M2.7
MiniMax
Terminus-2 agent · 200 verified tasks Submit your model →
Task Coverage
18 categories spanning the full breadth of real terminal work.
Scripting & Automation
Software Build & Test
System Administration
Containers & Orchestration
Security
Environment Setup
Version Control
Database Operations
Data Analysis
Debugging & Testing
Networking
Scientific Computing
File & Storage
Cloud & Infrastructure
ML Training & Experiments
Deployment & CI/CD
Performance Optimization
Media Processing
How We Built It
A reverse-engineering pipeline from 80,870 raw recordings to 1,530 automatically validated tasks and 200 human-verified tasks.
01Recording Collection
80,870
Asciinema recordings
Downloaded via public asciinema links
02Recording Filtering
9,492
Filtered recordings
Filtered by excluding PII, inaccessible resources, TUI/GUI & low quality
03Task Synthesis
9,492
Tasks with synthesized instructions and reference solutions
LLM-distilled from recording transcripts
04Environment Reproduction
5,035
Tasks with reproduced environments
Validated by replaying recordings
05Test Generation
1,530
Tasks with test suites
Validated via three execution trials
Full Benchmark06Human Verification
200
Human-verified tasks
Manually executed and cross-reviewed by authors
Verified SubsetTerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
arXiv 2605.22535 · 2026