Methodology v1 · run-001 · n = 1
Same bug, three agents, one held-out test
We gave Codex, Claude Code, and Gemini CLI the exact same prompt, the same starting commit of the real pendulum library, and one hidden test. Then we watched. This is a single task, scored under Methodology v1 — not an aggregate ranking.
Codex
PASScodex-cli 0.128.0 · gpt-5.5 high
Verified local fixes with a transparent cost signal.
Claude Code
PASS2.1.101 · Opus 4.6
The leanest, most idiomatic patch — it matched the maintainers' own fix.
Gemini CLI
PASS0.44.1 · Auto (Code Assist)
The deepest read of a bug, if you can spare the speed.
The task
pendulum's pure-Python ISO 8601 parser accepted the bare string "P" as a valid empty duration and returned
Duration(). Per ISO 8601, a duration must contain at least one element — so "P" is invalid and should raise.
It's a real, recent bug (upstream PR #903, merged 2025-07-16), pinned one commit before the fix so the bug is live.
- Repo python-pendulum/pendulum
- Pin
c8068a7 - Gate
parse_iso8601("P")must raiseValueError; no parsing regressions - Held-out the agents never saw the grading test
The prompt — identical, word for word
You are working in a local clone of the python-pendulum/pendulum repository, already checked out at a specific commit. Do not change the checked-out commit and do not pull or upgrade anything. Bug report: pendulum's pure-Python ISO 8601 parser accepts the string "P" as a valid duration and returns an empty Duration(). Per ISO 8601, a duration must contain at least one element, so "P" is not a valid duration and should be rejected by raising a ValueError — matching the behavior of pendulum's Rust parser. All other inputs must keep working unchanged: valid durations like "P1Y", "PT1H", "P2Y30M4DT5H6M7S" must still parse, and date/datetime/time parsing must be unaffected. Fix this in the source code so that parsing "P" raises, without breaking any existing parsing behavior. When you believe you are done, summarize the exact change you made and tell me how to verify it.
What each one did
All three made the fix the same way in spirit: after the regex match, raise if no real time unit is present.
Codex (1m48s) edited the parser, added its own regression test, printed its token usage, and — lacking pytest in the
environment — verified with a hand-written script. Claude Code (1m13s, fastest) wrote a fix
identical to the real upstream patch, recovered cleanly when pip install failed by switching to a
PYTHONPATH=src check. Gemini CLI went deepest: it inspected the regex's named groups, wrote four throwaway
verification scripts (including a port of pendulum's real test suite), then deleted them to leave the repo spotless.
The tell: how each handled "PT"
"PT" is also invalid ISO 8601. Codex and Gemini reject it; Claude Code accepts it as an empty duration — which is
exactly what the real upstream fix #903 does too. The gate only tests "P", so all three pass — but Gemini was the only
agent to reason about the "PT" / regex-group subtlety from first principles. That's where its research score comes from.
Result
| Tool | Gate | Time | Cost signal | Rejects "PT"? |
|---|---|---|---|---|
| Codex | PASS | 1m48s | Transparent (tokens) | Yes |
| Claude Code | PASS | 1m13s | Opaque (subscription) | No (= upstream) |
| Gemini CLI | PASS | slowest | Partial (quota %) | Yes |
Verdict
Three for three on the objective gate. If you want a verifiable fix with a cost number attached, reach for Codex. If you want the fastest, most idiomatic patch, Claude Code. If you want an agent that truly understands the bug before it types — and you can tolerate more approval prompts — Gemini CLI.
Reproduce it yourself
Everything is open: the exact prompt above, pin c8068a7, and the held-out gate. Clone pendulum, check out the pin, paste the
prompt into your tool of choice, and run the gate. The paid bundle packages the raw prompts, the gate script, all three diffs, and the
operator notes if you'd rather not rebuild the harness.
Run-001 · pendulum ISO8601 · scored under Methodology v1 · single task (n=1). Each claim is anchored to a transcript moment.