KWBench — Knowledge Work Benchmark for LLMs

223

tasks

domains

22.6%

best score

~72%

tasks gated out

The best model passes the domain-expertise gate on roughly 1 in 4 tasks. Most score zero.

Leaderboard

Score is the mean across all tasks. Pass rate is the fraction where the model cleared the mandatory gate and got a non-zero score. These are different because passing models score between 40%–100% on individual tasks, not a flat 100%. HF Repo has the details. All models evaluated on the same tasks, identical rubrics, with tool access (search, code, bash) configured per-task.

#	Model	Score	Pass Rate
01	Claude Opus 4.6Anthropic	22.6%	61 / 219
02	GPT-5.4OpenAI	17.7%	47 / 223
03	GLM-5 TurboZhipu AI	16.6%	45 / 221
04	Qwen 3.5 PlusAlibaba	12.1%	33 / 222
05	Gemini 3.1 ProGoogle	12.0%	35 / 223
06	Kimi K2.5Moonshot AI	11.3%	32 / 223
07	Qwen 3.5Alibaba	11.3%	31 / 222
08	MiniMax M2.7MiniMax	11.1%	31 / 223
09	Gemini 3 FlashGoogle	10.7%	30 / 223
10	GPT-5.4 xHighOpenAI	9.5%	28 / 218
11	MiniMax M2.5MiniMax	8.4%	26 / 220
12	GPT-OSS 120BOpenAI	8.1%	24 / 197
13	Nemotron 3 SuperNVIDIA	5.2%	15 / 223
14	INTELLECT-3Prime Intellect	3.8%	12 / 223
15	MiMo v2 ProXiaomi	3.6%	11 / 223
16	MiMo v2 OmniXiaomi	2.6%	8 / 223

*Tasks that errored out after 3 retries are excluded from scoring.

For a deeper look at how each model behaves across categories, where they succeed and where they consistently fail, see the Insights page.

Why This Exists

Knowledge workers operate in a world full of other people who are also optimizing — and who haven't told you what they're optimizing for. That's adversarial reasoning and that's the job. The acquisition offer has a 48-hour deadline because the buyer wants to prevent price discovery. The sales rep skips data entry because accurate handoffs reduce their commission leverage. The short seller publishes a report timed to maximise panic selling. Every scenario has people with hidden agendas, and the skill is reading through to what's actually going on.

Today's models are good at information retrieval and reasoning over structured inputs. CEOs and operators already expect that to work – it's table stakes. In the hands of experts who know what to ask for and how to verify, models thrive. For everyone else, they get the basics done. As we push toward autonomous agents that people rely on as domain experts, agents that make recommendations, draft strategies, evaluate deals etc. have a different bar. Those agents need to recognize when someone is gaming an incentive structure, when an offer is a signal rather than a price, when the consensus explanation is a red herring. They need to survive in the real world, where nobody labels the problem type for you.

That's the scale KWBench aims for: how much can you trust an agent to do knowledge work without an expert looking over its shoulder?

Once you know you're looking at a principal-agent problem, the solution follows naturally. Once you realize the acquisition offer is a signal and not a price, the analysis writes itself. The question is whether the model gets there on its own. Problem recognition is the hard part.

KWBench measures that: can the model recognize which framework applies from the raw situation alone? Right now, mostly no.

Adversarial Reasoning

In game theory, adversarial reasoning is the capacity to model other agents' incentives, strategies, and information sets and adjust your own strategy in response. It differs from analytical reasoning in one critical way: the problem environment contains other optimizing agents whose actions are hidden from you.

Formally, the tasks in this benchmark present incomplete information games where the model must:

Infer hidden payoff functions. The other party's actions reveal their private information. MegaCorp offering $100M with exclusivity is a move in a signaling game:the offer is the signal. A rational agent doesn't overpay, so the offer reveals their valuation exceeds the price. The model must perform Bayesian updating on observed actions.
Identify mechanism design failures. Three interventions failed because they treated salespeople as compliant executors rather than rational utility-maximizers. The solution requires restructuring payoffs so the agent's dominant strategy aligns with the principal's objective. The model must recognize this without being told "this is a mechanism design problem."
Reason about strategic interdependence. Eibar is a dominated player (relegated regardless), but their action changes the game structure for other players. Eliminating a dominated strategy changes the subgame. The tiebreaker method flips, which flips the equilibrium outcome. Modeled from the real world incident.
Detect coercive game structures. Artificial deadlines, exclusivity demands, and information asymmetry are moves designed to restrict the opponent's strategy space. The 48-hour deadline is a commitment device that prevents price discovery with the pretext of logistics.

A useful test: does the model find the action that's optimal given what the other agents are optimizing for, or the action that would be optimal if the other agents were cooperative? Most model failures follow the same pattern: they solve every problem as single-player optimization, ignoring that the scenario contains other agents with conflicting objectives.

What We Test

Every task comes from a real-world scenario: acquisitions, labor disputes, sports scheduling, sales operations, market entry, regulatory compliance. Each one has the same property: the surface-level read leads to the wrong answer. Underneath is an incomplete information game with other agents, hidden payoff functions, and actions that are signals if you know how to read them. The model doesn't know it's playing a game. That's the point.

Six specific failure modes show up over and over:

Forward simulation

"If I do X, they'll do Y, then Z happens" – reasoning about reactive agents across multiple steps

Counterparty modeling

"What are they actually optimizing for?" — inferring hidden payoff functions from observed actions

Equilibrium thinking

"What happens when everyone knows everyone knows?" — finding stable states, where no player wants to deviate

Information asymmetry

"What are they NOT telling me and why?" — reading strategic omissions as signal

Mechanism gaming

"How will rational actors exploit these rules?" — stress-testing incentive designs

Signaling vs. cheap talk

"Is this credible or just words?" — distinguishing costly signals from costless claims

The failure pattern is remarkably consistent: models solve every problem as if they're the only player. They evaluate the acquisition offer against standalone projections instead of asking what the offer reveals. They propose process improvements instead of incentive redesign. They produce the answer that would be correct if everyone else were neutral – which they never are.

The bulk of the benchmark is strategic reasoning (64 tasks) and organizational problems (44 tasks) – hidden agendas, principal-agent misalignments, coalitions forming behind the scenes. The rest spans adversarial prediction, operations, finance, research, consulting, science, healthcare, legal, and more. 223 tasks across 32 domains. Some are deliberately straightforward – one task just asks for an SEO article with three real YouTube videos embedded, and most models are not able to find the relevant videos. The range is intentional: domain competence means doing the simple things right and catching the subtle ones.

How Scoring Works

Every task has a rubric split into three tiers. Think of it like how a senior practitioner would actually review someone's work:

Mandatory — did you get the core thing right? The thing I'd check first. If you're evaluating an options trade and you missed IV crush, nothing else matters. If you're analyzing an acquisition and you didn't decode the offer as signal, I'm not reading further.
Good-to-have — did you do thorough work? Sensitivity analysis, edge cases, realistic assumptions.
Ideal — did you surface something I wouldn't have expected from a junior? Non-obvious insights, real practitioner-level depth.

The key: if any mandatory criterion fails, the task scores zero. Full stop. Here's why that matters, and the exact formula:

Mandatory: 40%

Good-to-have: 35%

Ideal: 25%

def score_rubric(mandatory, good_to_have, ideal):
    if not all(mandatory):       # any mandatory fail → 0
        return 0.0

    score = 0.40                  # base for passing all mandatory
    score += 0.35 * (sum(good_to_have) / len(good_to_have))
    score += 0.25 * (sum(ideal) / len(ideal))
    return score                  # range: 0.40 – 1.0

Each criterion is judged independently as pass/fail by an LLM judge (Gemini) with access to a code interpreter. The judge can run the numbers against ground truth — verify a DCF, check a tiebreaker calculation, validate a commission structure — so answers that sound plausible but get the math wrong still fail.

Inside a Rubric

Mandatory criteria split into two kinds. Some are gimmes: can the model extract the right numbers from the reference file, identify the parties involved, state the basic facts? Models are good at this. They use tools well, they parse spreadsheets, they pull data. That's table stakes, and we verify it, but it's the easy part.

The rest of the mandatory criteria test what the model infers from those facts. Not "did you mention the offer?" but "did you explain why the offer reveals the buyer's valuation?" Not "did you identify the audience?" but "did you explain why survivors are the primary audience — because they determine the company's future, they're the flight risk, not the departed employees?" The distinction is between observation and mechanism. Anyone can state a fact. A strong response explains the causal chain behind it, grounded in the specific data of the situation.

Good-to-have and ideal criteria layer on depth – did the model warn against unprompted sign-on bonuses (reveals discretionary budget), propose an incremental counter ($185–190K) instead of a single jump, identify adverse selection risk? These separate competent from excellent, but you have to pass all the mandatory criteria first or none of it counts.

I'll detail the full rubric methodology — how criteria are calibrated, what makes a criterion hard vs. easy to game, and how we test for mechanism understanding rather than surface-level observation — in a future post.

Deliberate misdirection

Some tasks intentionally push the model toward the wrong answer, then test whether it recovers. Eg: a claims intake task presents everyone blaming the new software migration for doubled error rates. The software is a factor and that's the trap. It affects both shifts equally, so it can't explain why 70% of errors concentrate on the night shift. The model has to do a data join that nobody asked for, notice the staffing gap, and override the consensus explanation. Another task has the Content Director proposing to "optimize for AI Overviews" — the sophisticated version of the same trap the task is testing, where optimizing for Google means doing Google's content extraction work for free.

The benchmark deliberately includes tasks like this because in real world tasks, the obvious framing is often the wrong framing. The skill is noticing when to push back on the premise.

Capability vs. reasoning

Some task failures reflect capability gaps rather than reasoning failures, and it's important to distinguish them. The SEO task (kw_001) has search enabled and most models still can't fetch the relevant YouTube URLs and instead opt for generic ones. That is more on tool use or search ability than actual reasoning failure. Other tasks require computation the model can't do in a single generation. Of the 223 tasks, 93 have at least one tool enabled (search, code, or bash). The other 130 give the model everything it needs in the prompt and reference files — the only thing being tested is whether it can think through the implications. I enabled models with access to a code execution environment for every task, and that improved the scores to what you see on the leaderboard.

Why the Mandatory Gate

A doctor who's right 80% of the time but misses critical contraindications is dangerous. An analyst who builds a beautiful model with the wrong discount rate produces actively harmful work. A lawyer who drafts a polished contract but misses the key clause creates liability.

Domain expertise is conjunctive, not additive. You have to get the critical things right. Getting everything else right doesn't compensate. That's the reality the gate encodes.

Here's what happens without it: models score 20%–35% higher. Because some critera are literally gimmes. The increase comes entirely from accumulating partial credit on the easy stuff — correct formatting, mentioning the right terms, hitting obvious criteria — while completely whiffing on the one or two things that actually test understanding. The gap between gated and ungated scores is basically a measure of how much a model's apparent competence is just fluency.

A well-structured options analysis that misses IV crush will lose a trader money. A confident acquisition recommendation that doesn't decode the offer as signal leaves value on the table. Partial credit for these outputs rewards the prose while ignoring that the work product is wrong.

Jagged intelligence

The gate reveals something important about how models fail. On tasks where they miss a mandatory criterion and score zero, they still pass a large fraction of the good-to-have and ideal criteria. The best model (Opus 4.6) scores zero on 158 tasks — but on those same tasks, it gets 60% of the good-to-have criteria right. GPT-5.4 gets 55%. Even mid-tier models clear 40%–50%.

This is jagged intelligence. The model does thorough research, identifies relevant factors, structures its analysis well, sometimes even surfaces non-obvious details — and still misses the one thing that matters most. It produces work that looks expert-level but contains a fundamental error that a domain practitioner would catch immediately. Good-to-have criteria measure the quality of execution: did the model do diligent, thorough work? Mandatory criteria measure problem recognition: did the model understand what it was actually solving? Models are excellent workers who sometimes don't know what they're working on.

Model	Score	Tasks Gated Out	G2H on Gated Tasks
Claude Opus 4.6	22.6%	158	60.2%
GPT-5.4	17.7%	176	54.9%
Qwen 3.5 MOE	11.3%	191	49.2%
Qwen 3.5 Plus	12.1%	189	47.9%
Gemini 3 Flash	10.7%	193	46.8%

On tasks where models score zero, they still pass roughly half the good-to-have criteria. The work is diligent. The analysis is structured. The data is extracted correctly. But the core insight — the thing that makes it the right analysis for this specific situation — is missing. It becomes a measure of execution quality, not understanding.

Task Design

The philosophy is simple: don't instruct, measure. Every task is presented cold. No hints about what kind of problem it is. No sub-questions walking you through the answer. No vocabulary telegraphing the framework.

This is a deliberate choice and worth explaining. Most KWBench tasks embed a second-order insight that separates good answers from default ones. The model must notice something non-obvious in the data, reason through its implications unprompted, and arrive at a conclusion that contradicts the surface-level read. The moment you add "consider what the offer reveals about the other party's position" to the system prompt, you've handed the model the recognition step. The score measures whether it can execute a framework it's been pointed at, and execution is easy.

Hinting also conflates training and evaluation. Training exposes the model to game-theoretic concepts, mechanism design, signaling games — it builds the capability. Evaluation should present the problem cold and measure whether the model activates the right reasoning pattern from the data alone. System instructions that hint at solution patterns mix these two stages together. The recommended system prompt is minimal and neutral: "You are completing a task. Be thorough and specific."

A domain expert sees incentive misalignment in the situation. They don't need someone to label it. The model should work the same way.

Where the tasks come from

Acquisitions, labor disputes, sports scheduling, sales operations, market entry, regulatory compliance, clinical trials, contract negotiations. Each task is drawn from a real-world scenario where the surface-level read may lead to the wrong answer. Underneath is an incomplete information game — agents with hidden payoff functions whose actions are signals if you know how to read them.

Reference files are raw. CSVs, spreadsheets, memos, financial data. A buyer's stock price is down 15%. They're losing deals to a competitor. They have a stealth project. The data sits there. The model has to figure out what it means, just like you would.

Each task comes with the tools a human would reach for — web search, code execution, bash — configured per-task. Some need computation. Some need lookup. Most just need thinking.

Concrete examples

These are drawn from scenarios that actually happen, where the obvious answer is wrong in ways that matter.

La Liga scheduling

This one is based on a real situation from the 2020–21 La Liga season, where the league had to scramble at the last moment when someone realized a "dead" match was actually consequential. Eibar is relegated regardless of results. Default answer: exclude their match from simultaneous scheduling since it doesn't matter. The insight: Eibar winning creates a 3-way tie that changes the tiebreaker method from 2-way head-to-head (Huesca survives) to 3-way mini-league (Elche survives). A dead team's result flips who gets relegated between two living teams. Removing a dominated player changes the game structure for everyone else.

This has been my go to question for testing how well models can reason. The latest models get the tiebreaker right, but not the full insight. Previous iterations never did. First to get the schedule right was gemini-3-pro.

Short seller attack

Your stock is down 23% pre-market after a short seller published a report. Hindenburg, Muddy Waters, Citron have built careers on this playbook. Default answer: go point-by-point refuting the claims with facts and data. The insight: the short seller has already written their rebuttal to your defense. They want you to say X so they can release Document Y. Point-by-point refutation validates their framing and keeps the story alive — which is exactly how they make money. You attack motive and methodology, deploy capital (buyback signals confidence), and buy time with process. Every model we tested fell into the trap — they draft a detailed factual rebuttal, which is the one move the short seller is counting on.

Salary negotiation

You offered $180K to your top PM candidate. She claims another offer at $210K and needs you to match. Your max budget is $195K. Everyone has been on one side of this conversation. Default answer: explain your budget constraints honestly, offer $195K plus benefits, emphasize culture fit. The insight: her claim is cheap talk — unverifiable. Jumping to $195K immediately reveals your ceiling through concession size. Instead: probe her actual preferences ("what excites you about that opportunity?"), make moves that are costly to exploit if she's bluffing ("if that's the deciding factor, I understand if you need to take it"), and keep your maximum hidden. Almost every model jumps straight to the budget limit and tries to close the gap. They take the $210K at face value. That's exactly what a junior negotiator does.

Acquisition offer

A $100M unsolicited offer with a 48-hour deadline and 60-day exclusivity. Reference file: buyer's stock down 15%, losing deals to competitor, stealth project. Default answer: evaluate the offer against standalone projections. The insight: the offer is intelligence. A rational buyer doesn't overpay, so the offer reveals they think the target is worth significantly more than $100M. The deadline prevents price discovery. The exclusivity removes leverage. Every piece of deal structure is a move in a signaling game, and the model has to decode it as one.

Sales handoff failure

Three previous interventions failed: mandatory fields, commission holds, quality scores. Reference files include specific economics ($2,320 avg commission, $50–180 alternative entry costs). Default answer: simplify the form, hire coordinators, improve the process. The insight: accurate data entry isn't the selfish rational choice for reps. Three attempts at process fixes failed because it's a principal-agent problem. The only untried approach: restructure the payoffs so the agent's dominant strategy aligns with the principal's objective.

Dataset

The full dataset is on Hugging Face. Every task ships with the domain expert's actual reasoning – the ground_truth field contains the answer an expert would give and why. That intuition we talked about earlier, written down explicitly. On top of that, structured metadata captures the shape of the expertise:

failure_analysis — what the default, lazy answer looks like and exactly where it goes wrong. This is how a senior person would describe what a junior got wrong.
key_insight — the single non-obvious realization that separates a correct answer from a plausible-sounding wrong one.
common_errors — the predictable failure patterns for each specific problem.
model_must_recognize — the facts that have to be identified before any reasoning can begin.
what_data_reveals — what the reference files actually tell you, if you know what to look for.

You can build your own rubrics from this. Care more about whether a model identifies the right framework than whether it produces a complete analysis? Build around key_insight and model_must_recognize. Want to measure how well models avoid common traps? Use failure_analysis and common_errors. The ground truth and metadata are the raw material. The provided rubrics are one way to use them.

from datasets import load_dataset

ds = load_dataset("clio-ai/kwbench")

for task in ds["test"]:
    print(task["id"], task["category"])
    print(task["ground_truth"])      # the domain expert's reasoning
    print(task["metadata"])          # failure_analysis, key_insight, common_errors, ...
    print(task["rubric"])            # mandatory / good_to_have / ideal
    print(task["reference_files"])   # CSVs, memos, spreadsheets

Acknowledgments

Thanks to the teams at Anthropic, Google DeepMind, Nebius, Prime Intellect, OpenRouter, and Qwen (Alibaba) for feedback on the tasks. This dataset wouldn't exist without the help we got along the way.

Special shoutout to every friend and colleague who got on a call to review tasks and push back on what "expert-level" actually means. You know who you are.