TerminalWorld
Real-World Tasks. Real Impact.
Benchmark AI agents on the terminal workflows developers run every day. Live as practices evolve.
1,530
Full Benchmark
200
Human Verified
18
Task categories
1,280
Unique commands
Leaderboard
Updated May 21, 2026
Claude Opus 4.7
Anthropic
Kimi K2.6
Moonshot AI
GLM 5.1
Z.ai
Gemini 3.1 Pro
Qwen3.6-Max-Preview
Alibaba
GPT-5.5
OpenAI
DeepSeek-V4-Pro
DeepSeek
MiniMax M2.7
MiniMax
Terminus-2 agent · 200 verified tasks Submit your model →
Task Coverage & Examples
18 categories spanning the full breadth of real terminal work.
Scripting & Automation
#739272
Demonstrate that multiple programming languages are installed and functional on the system by running a "Hello World" program in each language and collecting…
Software Build & Test
#359207
Compile and install the Janus WebRTC gateway along with all required dependencies from source. The build environment is a Debian 10 system with the necessary…
System Administration
#473888
Install and configure an OpenSSH server so that it listens on port 3000 instead of the default port 22. SELinux must be disabled (set to `disabled` in…
Containers & Orchestration
#139853
Deploy an event stream analytics pipeline in an OpenShift cluster. When complete, the following resources must exist: 1. A `ConfigMap` named…
Security
#448247
Verify the authenticity and integrity of the Asus KGPE-D16 Dasharo Release v0.1.0 firmware image. This involves obtaining the appropriate GPG keys from the…
Environment Setup
#366394
Set up the `aafm` (Automated Analysis of Feature Models) framework from the `diverso-lab` project. Clone the `core`, `fm_metamodel`, and `pysat_metamodel`…
Version Control
#694892
The repository at `/app` has `git nomad` installed and configured to simulate two hosts: `desktop` and `laptop`. Each invocation of `git nomad sync`…
Database Operations
#542219
Set up a CockroachDB role hierarchy and produce a verification file at `/app/result.txt`. A CockroachDB cluster is accessible at…
Data Analysis
#241711
Download James Joyce's *Ulysses* from Project Gutenberg (`http://www.gutenberg.org/files/4300/4300-0.txt`) and perform an n-gram frequency analysis on its…
Debugging & Testing
#224933
Investigate why a Docker Swarm service fails to deploy and document your findings. A service named `alertmanager` has been created using the…
Networking
#583258
Configure a network namespace named `runc` and connect it to a host bridge using a virtual ethernet (veth) pair. The host must have an active bridge named…
Scientific Computing
#347571
Use TACT (Taxonomic Addition for Complete Trees) via its Docker image to process phylogenetic data for the Carangaria fish group. Example input files…
File & Storage
#299387
Using Kopia, back up the directory `~/Projects/Kopia` to a Google Cloud Storage bucket named `kopia-demo-1`. Create a Kopia repository in that bucket,…
Cloud & Infrastructure
#417298
Provision the Google Cloud infrastructure required to bootstrap a Kubernetes cluster from scratch. This involves creating a custom VPC network named…
ML Training & Experiments
#668460
Clone `https://github.com/saforem2/ezpz` into `/app/ezpz` and `https://github.com/saforem2/wordplay` into `/app/wordplay`. Set up a Python virtual environment…
Deployment & CI/CD
#169462
The environment at `/app` is configured with a YourBase (`yb`) project for the `hamilton` service. Use the `yb` CLI to capture service logs and performance…
Performance Optimization
#692136
Measure end-to-end Kafka message latency in two configurations and save the results to output files in `/app/`. The environment in `/app/` includes a Docker…
Media Processing
#104869
Convert a set of stereo WAV files to mono using FFmpeg. The audio files are located in an archive at `/tmp/original_wav.tar.gz`. Extract the archive and…
How We Built It
A reverse-engineering pipeline from 80,870 raw recordings to 1,530 automatically validated tasks and 200 human-verified tasks.
01Recording Collection
80,870
Asciinema recordings
Downloaded via public asciinema links
02Recording Filtering
9,492
Filtered recordings
Filtered by excluding PII, inaccessible resources, TUI/GUI & low quality
03Task Synthesis
9,492
Tasks with synthesized instructions and reference solutions
LLM-distilled from recording transcripts
04Environment Reproduction
5,035
Tasks with reproduced environments
Validated by replaying recordings
05Test Generation
1,530
Tasks with test suites
Validated via three execution trials
Full Benchmark06Human Verification
200
Human-verified tasks
Manually executed and cross-reviewed by authors
Verified SubsetTerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
arXiv 2605.22535 · 2026