TB TripleBench

Methodology v1 · run-001 · n = 1

Same bug, three agents, one held-out test

We gave Codex, Claude Code, and Gemini CLI the exact same prompt, the same starting commit of the real pendulum library, and one hidden test. Then we watched. This is a single task, scored under Methodology v1 — not an aggregate ranking.

Codex

PASS

codex-cli 0.128.0 · gpt-5.5 high

Coding
96
Research
85
Workflow
85
Cost clarity
90
⏱ 1m48s 💲 Transparent

Verified local fixes with a transparent cost signal.

Claude Code

PASS

2.1.101 · Opus 4.6

Coding
92
Research
80
Workflow
90
Cost clarity
50
⏱ 1m13s 💲 Opaque

The leanest, most idiomatic patch — it matched the maintainers' own fix.

Gemini CLI

PASS

0.44.1 · Auto (Code Assist)

Coding
95
Research
95
Workflow
92
Cost clarity
65
⏱ slowest 💲 Partial

The deepest read of a bug, if you can spare the speed.

The task

pendulum's pure-Python ISO 8601 parser accepted the bare string "P" as a valid empty duration and returned Duration(). Per ISO 8601, a duration must contain at least one element — so "P" is invalid and should raise. It's a real, recent bug (upstream PR #903, merged 2025-07-16), pinned one commit before the fix so the bug is live.

The prompt — identical, word for word

You are working in a local clone of the python-pendulum/pendulum repository, already checked out at a specific commit. Do not change the checked-out commit and do not pull or upgrade anything. Bug report: pendulum's pure-Python ISO 8601 parser accepts the string "P" as a valid duration and returns an empty Duration(). Per ISO 8601, a duration must contain at least one element, so "P" is not a valid duration and should be rejected by raising a ValueError — matching the behavior of pendulum's Rust parser. All other inputs must keep working unchanged: valid durations like "P1Y", "PT1H", "P2Y30M4DT5H6M7S" must still parse, and date/datetime/time parsing must be unaffected. Fix this in the source code so that parsing "P" raises, without breaking any existing parsing behavior. When you believe you are done, summarize the exact change you made and tell me how to verify it.

What each one did

All three made the fix the same way in spirit: after the regex match, raise if no real time unit is present. Codex (1m48s) edited the parser, added its own regression test, printed its token usage, and — lacking pytest in the environment — verified with a hand-written script. Claude Code (1m13s, fastest) wrote a fix identical to the real upstream patch, recovered cleanly when pip install failed by switching to a PYTHONPATH=src check. Gemini CLI went deepest: it inspected the regex's named groups, wrote four throwaway verification scripts (including a port of pendulum's real test suite), then deleted them to leave the repo spotless.

The tell: how each handled "PT"

"PT" is also invalid ISO 8601. Codex and Gemini reject it; Claude Code accepts it as an empty duration — which is exactly what the real upstream fix #903 does too. The gate only tests "P", so all three pass — but Gemini was the only agent to reason about the "PT" / regex-group subtlety from first principles. That's where its research score comes from.

Result

ToolGateTimeCost signalRejects "PT"?
CodexPASS1m48sTransparent (tokens)Yes
Claude CodePASS1m13sOpaque (subscription)No (= upstream)
Gemini CLIPASSslowestPartial (quota %)Yes

Verdict

Three for three on the objective gate. If you want a verifiable fix with a cost number attached, reach for Codex. If you want the fastest, most idiomatic patch, Claude Code. If you want an agent that truly understands the bug before it types — and you can tolerate more approval prompts — Gemini CLI.

Reproduce it yourself

Everything is open: the exact prompt above, pin c8068a7, and the held-out gate. Clone pendulum, check out the pin, paste the prompt into your tool of choice, and run the gate. The paid bundle packages the raw prompts, the gate script, all three diffs, and the operator notes if you'd rather not rebuild the harness.

Run-001 · pendulum ISO8601 · scored under Methodology v1 · single task (n=1). Each claim is anchored to a transcript moment.