Opus 4.6 vs GPT-5.4
Both models are the best from their respective labs. Their aggregate scores are 22.6% and 17.7%. You'd expect the stronger model to be a superset of the weaker one — solving everything GPT solves, plus more. That's not what happens.
35 tasks that Opus solves and GPT doesn't. 21 tasks that GPT solves and Opus doesn't. On the 26 tasks they both pass, their quality is virtually identical — Opus averages 0.816 and GPT averages 0.819. They're equally good when they both see the insight. They just see different insights.
There is no longer the scenario about one model being "better." It's two different patterns of domain recognition, with only a third of their successes overlapping. GPT-5.4 passes 21 tasks that the overall top-scoring model misses entirely.
No Model Is a Superset
Across the top 8 models, 113 of 223 tasks are passed by at least one model. But the overlap is remarkably thin.
44 tasks are solved by exactly one model. Not by the best model, or the biggest model — by a specific model that happened to have the right reasoning pattern for that specific problem. Opus uniquely solves 14 of those. GPT uniquely solves 10. Gemini 3.1 Pro uniquely solves 7. Even Kimi K2.5 and MiniMax M2.7 each uniquely contribute tasks no other model passes.
Task difficulty distribution
How many of the 8 models pass each task:
The distribution is bottom-heavy. The most common outcome is that a task is solved by exactly one model. Full consensus — all 8 models agreeing a task is solvable — happens twice out of 113.
Covering the Benchmark
If you wanted to maximize the number of tasks passed, how many models would you need? Start with the best, then greedily add whichever model covers the most remaining tasks:
The best single model covers just over half. You need three to get to 84%. You need all eight to reach 100% — every model in the stack contributes tasks that no other model solves.
Across all 17 models tested, 116 of 223 tasks are passed by at least one model. The remaining 9 models beyond the top 8 add just 3 more tasks. 107 tasks remain unsolved.
This is the practical implication. If you're building a system that needs reliable domain expertise across diverse problems, no single model covers the space. The models aren't converging on the same capability. They're developing different, partially overlapping competencies. Ensembling or routing between models is more than a mere performance trick. Because each model recognizes patterns the others are blind to.1
1 I see the same pattern in my own work on nanoevolve — using evolutionary optimization to optimize the optimizer in nanochat. Progress is remarkably slow when using only GPT-5.4 or only Opus 4.6; each model misses things the other catches. Real momentum comes from having one generate and the other review.
What Remains Unsolved
110 of 223 tasks are not passed by any of the top 8 models. These aren't edge cases or trick questions or gotchas. They span the same domains as the solvable tasks — acquisitions, operations, strategy, finance. They just require a level of problem recognition that current models haven't reached.
The 113 solvable tasks aren't the easy half. Many of them are solved by a single model that happened to activate the right reasoning pattern while seven others missed it. The boundary between "solvable" and "unsolvable" is whether a given model's training happened to produce the specific recognition capability that task demands.
Domain expertise is a not about measuring along a single axis. It's a surface with peaks and valleys, and every model has a different topography.
What This Says About the Benchmark
A natural question: does this divergence mean the benchmark is noisy, or the mandatory gate too strict? Three patterns in the data argue it's neither.
Models pass different tasks, not random subsets. If the gate were introducing noise — randomly failing models on tasks they should pass — you'd expect high overlap between top models. They'd all be "rolling the same dice." Instead, the Jaccard similarity between the top two models is 31.7%. Each model has a consistent, distinct pattern of what it recognizes. That's signal, not noise.
44 tasks are solved by exactly one model. If the gate were arbitrary, a task passable by one model should be passable by others with similar capability. Instead, specific models have specific recognition capabilities that others lack entirely. Opus solves kw_030 (the La Liga tiebreaker insight) while GPT doesn't. GPT solves kw_042 while Opus doesn't (DCF). These aren't random coin flips — they reflect genuinely different reasoning patterns.
When two models both pass, their quality is identical. On the 26 tasks both Opus and GPT pass, their average scores are 0.816 and 0.819. Nearly indistinguishable. This means the gate is measuring a real binary — either you recognized the problem or you didn't. Once you do, you produce good work. If the gate were too strict, you'd see quality variance among passing results. You don't.
The gate is strict because the thing it tests is genuinely binary. A doctor either catches the contraindication or doesn't. An analyst either decodes the offer as a signal or evaluates it at face value. There's no "partial credit" version of problem recognition. The divergence between models is not explained by harsh scoring — it's an early evidence that domain expertise is a jagged capability, and current models are developing it in genuinely different directions.
How Close They Get
The zero scores obscure how close models often come. On many gated-out tasks, the model misses a single mandatory criterion while nailing everything else.
Opus 4.6 has 55 tasks where exactly one mandatory criterion failed. On those tasks, it averages 67% on good-to-have and 61% on ideal criteria. Some examples:
| Task | Mandatory | Good-to-have | Ideal | Score |
|---|---|---|---|---|
| kw_009 | 5/6 | 6/6 | 7/7 | 0 |
| kw_175 | 4/5 | 5/5 | 5/5 | 0 |
| kw_064 | 4/5 | 4/4 | 4/4 | 0 |
| kw_200 | 4/5 | 5/5 | 4/5 | 0 |
| kw_031 | 6/7 | 5/6 | 5/6 | 0 |
kw_009: five out of six mandatory criteria, perfect on every good-to-have and ideal criterion. Score: zero. The model did outstanding work on everything except the one thing that defines whether the analysis is correct.
GPT-5.4 shows the same pattern: 77 near-miss tasks, averaging 62% on good-to-have when gated out by a single mandatory failure.
This is the sharpest illustration of jagged intelligence. The model isn't producing bad work. It's producing excellent work on the wrong problem. It extracts the right data, structures a thorough analysis, surfaces non-obvious details — and misses the one framing decision that an experienced practitioner would catch before reading past the first paragraph.
Honest Questions
"The benchmark is too adversarial?"
Not every task in KWBench is adversarial. kw_029 is a straightforward commodities trading crisis — no hidden agendas, no counterparty games. The data is in the prompt, the answer follows from the facts. kw_099 is a writing task that follows a 12-step process. These aren't trick questions. They test whether a model can do the job described, nothing more. The benchmark includes these deliberately, because domain competence means getting the straightforward things right and catching the subtle ones.
"Scores are low arbitrarily"
Every model scores low. The best model scores 22.6%. The interesting question isn't whether a model scores well — none do — it's which specific tasks it passes and what that reveals about its reasoning patterns. A model that passes 30 different tasks from the top model is more interesting than one that passes the same 30. The benchmark's value is in the disaggregated signal, not the aggregate number.
"You're just testing for a specific kind of thinking"
Yes. We're testing for the kind of thinking that distinguishes a senior practitioner from a competent junior. The kind where you read the situation before you do the analysis. Where you ask "what kind of problem is this?" before you start solving it. Models are getting remarkably good at the solving part. The recognizing part is where they fall short, and that's what this benchmark measures.
I will publish a deeper analysis of why models default to cooperative framing when the situation calls for adversarial reasoning. The pattern is consistent enough to be structural, not incidental. More on that soon.
"Have you tested against a harness?"
Not yet. All evaluations so far use the same setup — identical prompts, identical tool access, same judge. I haven't yet run systematic harness-level testing (varying system prompts, temperatures, retry strategies) to check how much scores shift with infrastructure choices. That's coming. I expect some variance, and I expect the core pattern — models solving different tasks, low overlap, jagged expertise — to hold.
Implementation
Each rubric criterion is judged as a separate binary call — pass or fail, independently, one at a time. No holistic scoring, no sliding scales. Judge is instructed to give a pass when it considers the answer partial. The judge is Gemini 3 Flash with access to a code interpreter so it can verify computations against ground truth rather than guessing whether an answer "sounds right." Opus as a judge was too harsh in my testing
The eval harness is fasteval. It handles concurrent model calls, tool execution (search, code, bash), and parallel judging. A full run — 223 tasks with tool use and judging — takes 20–30 minutes end to end depending on the model's response latency and rate limits.
All models get the same system prompt, the same tool configuration per task, and the same rubric. No model-specific tuning.
Models are evaluated via direct APIs for OpenAI, Anthropic, Gemini, and then used Nebius, Openrouter, Prime Intellect, and Alibaba Cloud for the open source models.