About TerminalWorld

A research project benchmarking AI agents on real-world terminal tasks.

Mission

TerminalWorld asks a simple question: how capable are today's AI agents at performing the kinds of terminal tasks that real developers do every day?

Unlike benchmarks constructed from hand-written problems, TerminalWorld derives its tasks directly from real developer recordings, capturing the authentic distribution of terminal work, including its messiness, diversity, and depth.

Our goal is to provide a rigorous, reproducible, and live benchmark for the community to track frontier progress in terminal agent capabilities.

Team

Get in Touch

For questions about the benchmark, dataset, or collaborations, feel free to reach out.

Acknowledgements

We are grateful to the many developers who have shared their terminal recordings publicly on asciinema, and to the asciinema platform for making such recordings accessible at scale. TerminalWorld would not exist without this community.

We thank the Terminal-Bench team for their pioneering work in terminal agent evaluation, whose framework and standardized agent scaffold provided a foundation that TerminalWorld builds upon and extends.

We gratefully acknowledge Amazon Web Services (AWS) for supporting this research through AWS credits and Bedrock access. These resources were used for large-scale task synthesis, benchmark construction, and model evaluation. We also thank the AWS team supporting UCL for their guidance.

Contact & Links

License

The TerminalWorld dataset and benchmark are released under CC BY-NC 4.0 . Non-commercial use only. Please cite our paper if you use this benchmark in your research.