TB TripleBench

Methodology · v1 · 2026-05-29

Same task, three tools, one acceptance gate

TripleBench compares Codex, Claude Code, and Gemini CLI by giving each one the same concrete task and measuring what happened. This page is the rulebook. If something we publish later contradicts it, the page wins, not the post — open an issue.

1. How we pick tasks

A task qualifies for the bench only if all four of these are true.

2. How we run each agent

  1. Pin the agent version. We record the exact CLI version or release tag of Codex, Claude Code, and Gemini CLI at the moment of each run.
  2. Reset the working directory to the pinned baseline commit. No leftover state from a prior run.
  3. Paste the same prompt. Word for word. No per-tool re-phrasing to make one happier.
  4. Let the agent work. We do not intervene to "rescue" a stuck run. We do not provide additional context unless the agent explicitly asks for it.
  5. Stop conditions: the agent declares done, the agent gets stuck and asks a question we won't answer mid-run, or the run hits a 60-minute wall-clock cap.
  6. Save the full transcript, the final diff, and the elapsed time. Run the acceptance gate. Record pass / fail.

3. What we score

Each run produces a row in our benchmark table. The columns are:

ColumnWhat it measuresHow we score it
Coding execution Did the agent correctly modify the code and verify its own changes? Pass/fail on the held-out gate, with a partial-credit note if the change is close but fails one specific check.
Research quality If the task required understanding context (issue threads, docs, related code), did the agent find and use the right material? Editorial 0-100 grade against an explicit checklist published per task. We show the checklist.
Workflow continuity Did the agent preserve state, recover from blockers, and hand off cleanly if interrupted? Editorial grade based on the transcript: did it summarize where it was, did it leave the repo in a working state, did it explain blockers.
Cost clarity Can an operator predict the dollar or token cost of the next run from this run's signal? Recorded as "transparent / partial / opaque." Transparent means the tool reported tokens or session minutes during the run.

4. The "best for" line

Every tool gets a one-sentence "best for" verdict. This is editorial, not a metric. It captures the gestalt of how the tool felt during the run, not just whether it passed the gate. We will be wrong sometimes; we will revise these as more runs accumulate.

5. What we publish per task

6. What we do not publish

7. Conflicts of interest

We benchmark all three tools independently. We do not accept money, free credit, free seats, or pre-publication review from any of the vendors we benchmark. If that ever changes, we'll disclose it on this page before publishing the affected report. The site is monetized through small digital downloads (raw prompts, CSV exports, operator notes) and, possibly, future paid subscriptions — never through affiliate links to the tools we test.

8. Why we're upfront about AI assistance

TripleBench is run by zgy. AI tools help organize the data and draft the writeups, but a human runs the actual benchmarks, reviews every published score, and signs off before anything goes live. We say so openly because we think the alternative — pretending an AI-curated site is human-written — degrades the open web. AI is good at organizing benchmark data; it is bad at deciding which benchmark is worth running. The human-in-the-loop part is the part that makes this site worth reading.

9. Versioning this methodology

This page is methodology v1. When we change the rubric, the weights, or the publication rules, we'll cut a new version, archive the old one, and note in each report which version of the methodology it was scored under. Old scores will not be silently re-graded.

Last updated: 2026-05-29 · Open an issue if you disagree with any of this.