NEW · arXiv 2605.22535
TerminalWorld logo

TerminalWorld

Real-World Tasks.  Real Impact.

Benchmark AI agents on the terminal workflows developers run every day. Live as practices evolve.

bash - task #417298 cloud-infrastructure

1,530

Full Benchmark

200

Human Verified

18

Task categories

1,280

Unique commands

Leaderboard

Updated May 21, 2026

Full leaderboard →
🥇

Claude Opus 4.7

Anthropic

62.5%
🥈

Kimi K2.6

Moonshot AI

57.5%
🥉

GLM 5.1

Z.ai

57.0%
4

Gemini 3.1 Pro

Google

55.0%
5

Qwen3.6-Max-Preview

Alibaba

54.0%
6

GPT-5.5

OpenAI

53.5%
7

DeepSeek-V4-Pro

DeepSeek

50.0%
8

MiniMax M2.7

MiniMax

49.0%

Terminus-2 agent · 200 verified tasks Submit your model →

How We Built It

A reverse-engineering pipeline from 80,870 raw recordings to 1,530 automatically validated tasks and 200 human-verified tasks.

01Recording Collection

80,870

Asciinema recordings

Downloaded via public asciinema links

02Recording Filtering

9,492

Filtered recordings

Filtered by excluding PII, inaccessible resources, TUI/GUI & low quality

03Task Synthesis

9,492

Tasks with synthesized instructions and reference solutions

LLM-distilled from recording transcripts

04Environment Reproduction

5,035

Tasks with reproduced environments

Validated by replaying recordings

05Test Generation

1,530

Tasks with test suites

Validated via three execution trials

Full Benchmark

06Human Verification

200

Human-verified tasks

Manually executed and cross-reviewed by authors

Verified Subset
arXiv

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

arXiv 2605.22535 · 2026