Methodology · v1 · 2026-05-29
Same task, three tools, one acceptance gate
TripleBench compares Codex, Claude Code, and Gemini CLI by giving each one the same concrete task and measuring what happened. This page is the rulebook. If something we publish later contradicts it, the page wins, not the post — open an issue.
1. How we pick tasks
A task qualifies for the bench only if all four of these are true.
- It's real. Pulled from a public open-source repository: a real issue, a real PR review request, a real refactor. No synthetic toy problems. No LeetCode-style puzzles the agents almost certainly saw during training.
- It's reproducible. The starting commit is pinned. The repo is public. Anyone can clone it and re-run the agent against the same baseline.
- It has an objective success gate. Either a test suite the agents never saw passes after their changes, or a documented manual check (such as "the build is green and the previously failing case now passes") can verify the result.
- It's representative. The task should look like work some real engineer somewhere is doing this week — a refactor, a bug fix, a feature wiring, a migration. Not a contrived showcase.
2. How we run each agent
- Pin the agent version. We record the exact CLI version or release tag of Codex, Claude Code, and Gemini CLI at the moment of each run.
- Reset the working directory to the pinned baseline commit. No leftover state from a prior run.
- Paste the same prompt. Word for word. No per-tool re-phrasing to make one happier.
- Let the agent work. We do not intervene to "rescue" a stuck run. We do not provide additional context unless the agent explicitly asks for it.
- Stop conditions: the agent declares done, the agent gets stuck and asks a question we won't answer mid-run, or the run hits a 60-minute wall-clock cap.
- Save the full transcript, the final diff, and the elapsed time. Run the acceptance gate. Record pass / fail.
3. What we score
Each run produces a row in our benchmark table. The columns are:
| Column | What it measures | How we score it |
|---|---|---|
| Coding execution | Did the agent correctly modify the code and verify its own changes? | Pass/fail on the held-out gate, with a partial-credit note if the change is close but fails one specific check. |
| Research quality | If the task required understanding context (issue threads, docs, related code), did the agent find and use the right material? | Editorial 0-100 grade against an explicit checklist published per task. We show the checklist. |
| Workflow continuity | Did the agent preserve state, recover from blockers, and hand off cleanly if interrupted? | Editorial grade based on the transcript: did it summarize where it was, did it leave the repo in a working state, did it explain blockers. |
| Cost clarity | Can an operator predict the dollar or token cost of the next run from this run's signal? | Recorded as "transparent / partial / opaque." Transparent means the tool reported tokens or session minutes during the run. |
4. The "best for" line
Every tool gets a one-sentence "best for" verdict. This is editorial, not a metric. It captures the gestalt of how the tool felt during the run, not just whether it passed the gate. We will be wrong sometimes; we will revise these as more runs accumulate.
5. What we publish per task
- The exact prompt.
- The pinned commit hash and the link to the public repo.
- Each agent's version pin.
- Full session transcripts (or links to them, when long).
- The acceptance gate, including the test suite if one was used.
- The diff each agent produced.
- Elapsed time and any reported cost signal.
- The score for each rubric column.
- A short editorial verdict and a "best for" line per agent.
6. What we do not publish
- We don't redact failed runs. Failures are published with the same level of detail as successes.
- We don't fold in "qualitative impressions" that aren't anchored in a specific transcript moment.
- We don't average across unrelated tasks to produce a single overall ranking. We report each task on its own.
7. Conflicts of interest
We benchmark all three tools independently. We do not accept money, free credit, free seats, or pre-publication review from any of the vendors we benchmark. If that ever changes, we'll disclose it on this page before publishing the affected report. The site is monetized through small digital downloads (raw prompts, CSV exports, operator notes) and, possibly, future paid subscriptions — never through affiliate links to the tools we test.
8. Why we're upfront about AI assistance
TripleBench is run by zgy. AI tools help organize the data and draft the writeups, but a human runs the actual benchmarks, reviews every published score, and signs off before anything goes live. We say so openly because we think the alternative — pretending an AI-curated site is human-written — degrades the open web. AI is good at organizing benchmark data; it is bad at deciding which benchmark is worth running. The human-in-the-loop part is the part that makes this site worth reading.
9. Versioning this methodology
This page is methodology v1. When we change the rubric, the weights, or the publication rules, we'll cut a new version, archive the old one, and note in each report which version of the methodology it was scored under. Old scores will not be silently re-graded.
Last updated: 2026-05-29 · Open an issue if you disagree with any of this.