TB TripleBench

AI tool benchmarks, built for operators

TripleBench

A practical comparison journal for Codex, Claude Code, Gemini CLI, and the next wave of AI work tools. We test the same task across three tools and publish the workflow, friction, and score.

Heads-up: the bars on the right are seed scores — directional placeholders used to calibrate the layout while we run the first real benchmark. Don't make purchase decisions on them. The first evidence-backed report lands within a few days.

Latest comparison

The first article is a launch scaffold based on a defined rubric. Replace seed scores with repeated live test evidence before marking it as final.

Codex

verified local fixes with a transparent cost signal

89/100 seed average across coding, research, workflow, and cost clarity.

Claude Code

the leanest, most idiomatic patch — it matched the maintainers' own fix

78/100 seed average across coding, research, workflow, and cost clarity.

Gemini CLI

the deepest read of a bug, if you can spare the speed

87/100 seed average across coding, research, workflow, and cost clarity.

What gets measured

TripleBench favors repeatable work over vibes: same prompt, same acceptance criteria, separate notes for tool limits and human setup.

Metric Question Weight
Coding execution Can the tool make correct local changes and verify them? 35%
Research quality Does it find current, relevant, source-backed information? 20%
Workflow continuity Can it preserve state, recover from blockers, and hand off cleanly? 30%
Cost clarity Can an operator predict the subscription, token, or usage cost? 15%