Benchmark table
Seed scores before the first repeated run
This table is intentionally marked as a seed dataset. It is useful for launch layout and scoring calibration, but it should not be treated as a final buyer recommendation until the live runs are repeated and linked.
| Tool | Coding | Research | Workflow | Cost clarity | Best fit |
|---|---|---|---|---|---|
| Codex | 96 | 85 | 85 | 90 | verified local fixes with a transparent cost signal |
| Claude Code | 92 | 80 | 90 | 50 | the leanest, most idiomatic patch — it matched the maintainers' own fix |
| Gemini CLI | 95 | 95 | 92 | 65 | the deepest read of a bug, if you can spare the speed |