Paper · TerminalWorld

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Zhaoyang Chu¹, Jiarui Hu^2*, Xingyu Jiang^2*, Pengyu Zou^2*, Han Li², Chao Peng³, Peter O'Hearn¹, Earl T. Barr¹,
Mark Harman¹, Federica Sarro¹, He Ye^1†

¹ University College London · ² Nanjing University · ³ Tencent

^*These authors contributed equally as co-second authors · ^†Corresponding author: he.ye@ucl.ac.uk

Abstract

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve.

Cite

BibTeX

@article{chu2026terminalworld,
  title={TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks},
  author={Zhaoyang Chu and Jiarui Hu and Xingyu Jiang and Pengyu Zou and Han Li and Chao Peng and 
          Peter O'Hearn and Earl T. Barr and Mark Harman and Federica Sarro and He Ye},
  journal={arXiv preprint arXiv:2605.22535},
  year={2026}
}

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Abstract

At a Glance

Cite