About TerminalWorld

A research project benchmarking AI agents on real-world terminal tasks.

Mission

TerminalWorld asks a simple question: how capable are today's AI agents at performing the kinds of terminal tasks that real developers do every day?

Unlike benchmarks constructed from hand-written problems, TerminalWorld derives its tasks directly from real developer recordings, capturing the authentic distribution of terminal work, including its messiness, diversity, and depth.

Our goal is to provide a rigorous, reproducible, and live benchmark for the community to track frontier progress in terminal agent capabilities.

Team

Zhaoyang Chu ↗

University College London

Jiarui Hu

Nanjing University

Xingyu Jiang

Nanjing University

Pengyu Zou

Nanjing University

Han Li ↗

Nanjing University

Chao Peng ↗

Tencent

Peter O'Hearn ↗

University College London

Earl T. Barr ↗

University College London

Mark Harman ↗

University College London

Federica Sarro ↗

University College London

He Ye ↗

University College London

Get in Touch

For questions about the benchmark, dataset, or collaborations, feel free to reach out.

Zhaoyang Chu

First Author

zhaoyang.chu.25@ucl.ac.uk

He Ye

Corresponding Author

he.ye@ucl.ac.uk

Acknowledgements

We are grateful to the many developers who have shared their terminal recordings publicly on asciinema, and to the asciinema platform for making such recordings accessible at scale. TerminalWorld would not exist without this community.

We thank the Terminal-Bench team for their pioneering work in terminal agent evaluation, whose framework and standardized agent scaffold provided a foundation that TerminalWorld builds upon and extends.

We gratefully acknowledge Amazon Web Services (AWS) for supporting this research through AWS credits and Bedrock access. These resources were used for large-scale task synthesis, benchmark construction, and model evaluation. We also thank the AWS team supporting UCL for their guidance.

Contact & Links

GitHub ↗

Code, dataset tools, and leaderboard submissions

HuggingFace Dataset ↗

Download and explore the full task dataset

arXiv Paper ↗

Read the full technical paper

Leaderboard Submissions →

Submit your model results via GitHub PR

License

The TerminalWorld dataset and benchmark are released under CC BY 4.0 . Please cite our paper if you use this benchmark in your research.