Editorial · May 29, 2026 · 5 min read

Why TripleBench exists

Most "Tool A vs Tool B vs Tool C" articles about AI coding agents are written by someone who has only really used one of them. We benchmark all three. Here's what we're going to do with that, and why we think it matters.

If you spend an evening searching for "Codex vs Claude Code vs Gemini CLI" — or any other comparison of paid AI coding agents — the picture you get is a strange one. Almost every comparison article on the open web has one of three shapes.

First, the vendor blog post. Tool A's team writes about why Tool A beats B and C on benchmarks Tool A's team picked. The other two tools always come out looking worse. You learn very little.

Second, the single-user opinion piece. A developer who has paid for one of the three writes about why their pick is great. The other two are described from secondhand impressions or a five-minute trial. The post reads honest, and may even be honest, but it can't tell you what you actually want to know — which is what these tools look like when you push them past the easy demos.

Third, the AI-generated SEO mill. A site you've never heard of publishes a "definitive guide" written entirely by another AI. Every paragraph hedges. The "verdict" is that they're all good and you should try each one. The page exists to rank on Google, not to help you decide.

None of these are useful if you're an engineer about to spend a hundred dollars a month — or a thousand, for a small team — on the wrong subscription.

The cost of guessing wrong

Each of these tools, at the paid tier where they actually become useful, costs in the range of $20 to $200 per month per seat. Multiply that by a team of five or ten engineers, and the wrong pick is a real budget item. More importantly: the wrong pick means your team works around the agent instead of with it. The productivity tax of fighting your tools is harder to measure than the subscription fee, but it's usually larger.

And the right pick depends on what you actually do. If most of your week is React component work, the tool that crushes a Python migration benchmark is irrelevant to you. If you live in a Rails monorepo, the tool that wins on a tiny standalone Go service is irrelevant to you. The interesting question is never "which is best in general." It's "which is best for the kind of thing I actually do."

That question can only be answered by running each tool against work that looks like your work. Which is exactly what nobody publishes.

What's different here

TripleBench runs all three tools at the same time, head to head, on the same task. That's it. That's the whole pitch.

Every benchmark we publish has three things you can verify:

The prompt is identical. Copy-pasteable. Word-for-word the same to all three agents.
The starting repo is identical. Frozen at a specific commit, public, you can clone it and reproduce.
The success criterion is identical. A held-out test suite the agents never saw, run on the resulting code with no human intervention.

What changes between runs is only the agent. Not the prompt style, not the working environment, not the human prodding it through the hard parts. If Claude Code asks a clarifying question, we let it. If Codex doesn't, we let it. We're benchmarking the agent including its UX — not a sanitized version that only exists in a marketing video.

What we won't do

We won't run synthetic benchmarks designed to make one tool look good. The tasks we pick come from real bug reports, real PRs, real refactors — pulled from open-source projects so they're verifiable. If a task is "solve this LeetCode problem," the agents have probably seen it during training and the result is meaningless. If a task is "fix this specific issue filed against this real project last week," it's a much harder benchmark to game.

We won't hide failed runs. If one tool burns 40 minutes and produces broken code, we publish that. If one tool refuses to attempt the task, we publish that too. The shape of the failures is often more interesting than the shape of the successes.

We won't sign affiliate deals with the vendors we benchmark, and we take no money, free credit, or pre-publication review from them. The site is monetized in ways that don't depend on which tool wins this week — small digital downloads with the raw prompts and notes, future paid subscribers — but never on telling you to buy the tool that paid us the most.

Who's running this

TripleBench is run by zgy (Guiyan Zhou). AI tools help draft the writeups and keep the benchmark library organized — but zgy runs every CLI agent on a real workstation, captures the logs, and reviews everything before it goes live.

We're telling you that up front because we think the alternative — AI-curated content that pretends to be human-written — is corrosive to the open web. AI agents are very good at organizing benchmark results into readable prose. They are not good at deciding which benchmark is worth running, or at noticing when a result is suspicious. The human-in-the-loop part is what makes this site useful instead of being another mill.

What's next

The first real benchmark is already in progress: a Python refactor on a roughly two-thousand-line legacy module from an open-source repository. We're running it on all three agents this week. The writeup will land on this site within a few days, with the exact prompts, the logs, the diffs, and a verdict.

Until those evidence-backed runs are published, the scores you'll see on this site are seed scores: directional estimates used to test the layout and the rubric. They are clearly labeled. Don't make purchase decisions on them.

If you have a benchmark task you'd like to see — a real one, from your real work — open an issue on the GitHub repo. We won't promise to run all of them, but we'll prioritize the ones that look like work other people do.

— zgy, a tired engineer in Guangzhou (with a lot of help from AI).