TerminalWorld

Benchmark AI agents on real-world terminal workflows developers run every day. Live as practices evolve.

Task #417298 · Cloud & Infrastructure

View Leaderboard Browse Tasks Dataset ↗ Evaluate Your Agent

Full Benchmark

Human Verified

Task categories

Unique commands

Leaderboard

Updated May 21, 2026

Full leaderboard →

Performance vs. Cost

Verified subset · cost per solved task

↖ most efficient

Claude Opus 4.7

Anthropic

62.5%

Kimi K2.6

Moonshot AI

57.5%

GLM 5.1

Z.ai

57.0%

Gemini 3.1 Pro

Google

55.0%

Qwen3.6-Max-Preview

Alibaba

54.0%

GPT-5.5

OpenAI

53.5%

DeepSeek-V4-Pro

DeepSeek

50.0%

MiniMax M2.7

MiniMax

49.0%

Terminus-2 agent · 200 verified tasks · default thinking/reasoning effort Submit your model →

Task Coverage & Examples

18 categories spanning the full breadth of real terminal work.

Browse all tasks →

Scripting & Automation

350

Example Very Hard

asciinema ↗

#299830

A running Vault instance is available and the Transit secrets engine is already enabled. Create a named key ring called `demo-keys` and rotate it to produce a…

vaultgrepwritesecretbase64

Software Build & Test

212

Example Very Hard

asciinema ↗

#359207

Compile and install the Janus WebRTC gateway along with all required dependencies from source. The build environment is a Debian 10 system with the necessary…

gitcmakemakemesonninja

System Administration

191

Example Easy Verified

asciinema ↗

#473888

Install and configure an OpenSSH server so that it listens on port 3000 instead of the default port 22. SELinux must be disabled (set to `disabled` in…

sudogrep

Containers & Orchestration

184

Example Very Hard Verified

asciinema ↗

#139853

Deploy an event stream analytics pipeline in an OpenShift cluster. When complete, the following resources must exist: 1. A `ConfigMap` named…

Security

126

Example Medium Verified

asciinema ↗

#448247

Verify the authenticity and integrity of the Asus KGPE-D16 Dasharo Release v0.1.0 firmware image. This involves obtaining the appropriate GPG keys from the…

gpgwgetsha256sum

Environment Setup

100

Example Hard

asciinema ↗

#366394

Set up the `aafm` (Automated Analysis of Feature Models) framework from the `diverso-lab` project. Clone the `core`, `fm_metamodel`, and `pysat_metamodel`…

gitpython3pipcp

Version Control

Example Easy Verified

asciinema ↗

#694892

The repository at `/app` has `git nomad` installed and configured to simulate two hosts: `desktop` and `laptop`. Each invocation of `git nomad sync`…

git

Database Operations

Example Medium

asciinema ↗

#542219

Set up a CockroachDB role hierarchy and produce a verification file at `/app/result.txt`. A CockroachDB cluster is accessible at…

cockroach

Data Analysis

Example Medium Verified

asciinema ↗

#241711

Download James Joyce's *Ulysses* from Project Gutenberg (`http://www.gutenberg.org/files/4300/4300-0.txt`) and perform an n-gram frequency analysis on its…

wgetrmtrngramsort

Debugging & Testing

Example Hard

asciinema ↗

#224933

Investigate why a Docker Swarm service fails to deploy and document your findings. A service named `alertmanager` has been created using the…

dockeralertmanagerserviceheadinspect

Networking

Example Medium

asciinema ↗

#583258

Configure a network namespace named `runc` and connect it to a host bridge using a virtual ethernet (veth) pair. The host must have an active bridge named…

ipbrctl

Scientific Computing

Example Easy Verified

asciinema ↗

#347571

Use TACT (Taxonomic Addition for Complete Trees) via its Docker image to process phylogenetic data for the Carangaria fish group. Example input files…

curldockercreatetact_build_taxonomic_treetact_add_taxa

File & Storage

Example Easy Verified

asciinema ↗

#299387

Using Kopia, back up the directory `~/Projects/Kopia` to a Google Cloud Storage bucket named `kopia-demo-1`. Create a Kopia repository in that bucket,…

gsutilkopiauuidgenrm

Cloud & Infrastructure

Example Medium

asciinema ↗

#417298

Provision the Google Cloud infrastructure required to bootstrap a Kubernetes cluster from scratch. This involves creating a custom VPC network named…

gcloud

ML Training & Experiments

Example Hard

asciinema ↗

#668460

Clone `https://github.com/saforem2/ezpz` into `/app/ezpz` and `https://github.com/saforem2/wordplay` into `/app/wordplay`. Set up a Python virtual environment…

gitpython3pipmpirun

Deployment & CI/CD

Example Hard

asciinema ↗

#169462

The environment at `/app` is configured with a YourBase (`yb`) project for the `hamilton` service. Use the `yb` CLI to capture service logs and performance…

ybcurlgit

Performance Optimization

Example Medium

asciinema ↗

#692136

Measure end-to-end Kafka message latency in two configurations and save the results to output files in `/app/`. The environment in `/app/` includes a Docker…

dockerkafka-topicskafka-run-classphysical-kafkavia-gateway

Media Processing

Example Medium

asciinema ↗

#104869

Convert a set of stereo WAV files to mono using FFmpeg. The audio files are located in an archive at `/tmp/original_wav.tar.gz`. Extract the archive and…

tarfindffmpegmvsort

How We Built It

A reverse-engineering pipeline from 80,870 raw recordings to 1,530 automatically validated tasks and 200 human-verified tasks.

01Recording Collection

80,870

Asciinema recordings

Downloaded via public asciinema links

02Recording Filtering

9,492

Filtered recordings

Filtered by excluding PII, inaccessible resources, TUI/GUI & low quality

03Task Synthesis

9,492

Tasks with synthesized instructions and reference solutions

LLM-distilled from recording transcripts

04Environment Reproduction

5,035

Tasks with reproduced environments

Validated by replaying recordings

05Test Generation

1,530

Tasks with test suites

Validated via three execution trials

Full Benchmark

06Human Verification

200

Human-verified tasks

Manually executed and cross-reviewed by authors

Verified Subset

arXiv

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

arXiv 2605.22535 · 2026

Read paper ↗ Cite

Questions or collaboration? zhaoyang.chu.25@ucl.ac.uk