For several years, artificial intelligence has been evaluated through a series of benchmarks that have become familiar in the ecosystem: MMLU, BIG-Bench, GSM8K, ARC. These test batteries have played a decisive role in accelerating model performance. They have contributed to structuring global technological competition. They have also shaped a very specific representation of what constitutes “a good model”: a system capable of solving a large number of standardized problems alone, from static prompts, in a context devoid of interaction.

Yet this representation is increasingly disconnected from real-world usage. In companies, government agencies, consulting firms, research laboratories, or newsrooms, AI almost never acts in isolation. It is solicited, guided, corrected, and prompted again. It has emerged as a cognitive partner, rather than simply an autonomous problem-solving engine.

This is precisely the gap addressed by the study from Christoph Riedl (Northeastern University) and Ben Weidmann (University College London), titled Quantifying Human–AI Synergy sep 2025.

Their proposition is simple in principle, but radical in its implications.
It consists of shifting AI evaluation from autonomous performance to collaborative performance. They therefore measure what a model actually enables when integrated into human interaction.

No longer asking solely “what can AI do?” but “what does a human become when working with this AI?”

This shift in perspective profoundly transforms the very notion of performance.

The authors start from an observation now widely shared in cognitive and social sciences: intelligence, both human and artificial, is interactive, contextual, and distributed. Complex reasoning does not unfold in isolation. It emerges through exchange, reformulation, confrontation of viewpoints, and iteration. Large language models naturally fit into this cognitive regime. Yet the dominant evaluation instruments continue to treat them as solitary entities.

To fill this gap, Riedl and Weidmann rely on ChatBench, an interactive adaptation of the MMLU benchmark. The experimental protocol distinguishes three situations: humans answering alone, models answering alone, and human-AI pairs solving the same types of questions. The domains covered are mathematics, physics, and moral reasoning. The sample includes 667 human participants, confronted with 396 questions of varying difficulty, working either with GPT-4o or with Llama-3.1-8B.

This setup allows direct comparison between individual and collaborative performance. But the study’s essential contribution does not lie in the simple juxtaposition of averages. It resides in the statistical architecture employed.

This setup makes it possible to simply compare what a person is worth alone versus what they are worth when working with AI. But the study’s major interest is not limited to this comparison. The authors use an advanced statistical method that clearly separates several elements: users’ actual level, question difficulty, and the added value provided by AI. Concretely, they measure on one side a person’s ability to solve a problem alone, and on the other their ability when working with AI. They also account for the fact that some questions are more difficult than others. This enables calculation of a key indicator: the “AI boost.” It corresponds to the performance gain obtained through AI for the same user. In other words, we no longer just look at whether a model performs well, we measure how much it actually improves a human’s work.

We thus move from a logic of raw scores to a logic of marginal added value.

The initial results are unambiguous. On average, humans alone achieve about 55.5% correct answers, GPT-4o alone reaches 71%, and Llama-3.1-8B alone plateaus around 39%. But when humans work with these models, performance changes in nature. Even the weakest model, Llama-3.1-8B, enables human–AI pairs to far exceed the performance of humans alone.

Moreover, human–GPT-4o pairs achieve scores higher than GPT-4o working alone. AI becomes better when inserted into human interaction. Answer quality depends not only on neural network weights; it emerges from the coupling between the model’s capabilities and those of the user.

The study goes further by comparing the models’ inherent collaborative capacity. The authors rigorously control for differences in task difficulty, individual aptitude, and users’ collaborative aptitude. They estimate that Llama-3.1-8B provides an average boost of 23 percentage points. GPT-4o, for its part, provides an average boost of 29 points.

The credibility intervals do not overlap. GPT-4o therefore possesses superior human amplification capacity. This notion of a model’s collaborative capacity constitutes a major conceptual shift. Until now, models were ranked according to their autonomous performance. From now on, another dimension becomes measurable: the ability to make humans better. A second fundamental result concerns the very nature of the skills mobilized. The authors explicitly test whether individual performance and collaborative performance rest on a single latent aptitude or on two distinct aptitudes.

Working alone and working with AI mobilize different skills. This observation invalidates a widely held implicit assumption: the best experts would mechanically be the best AI users. In reality, some people possess strong collaborative potential independently of their solo performance level. They know how to delegate, formulate, evaluate, correct, and orchestrate. These skills constitute specific capital.

The next question then becomes central: who benefits most from AI?

The analysis reveals an interesting triptych. First, the more difficult a task is for a human alone, the more value AI brings. AI acts as a cognitive amplifier in areas of high mental load. Second, the most competent users remain, in absolute terms, the best performers when working with AI. Hierarchies do not reverse. Finally, lower-level users obtain a larger relative boost. AI partially reduces gaps without eliminating them.

These three effects coexist. They explain why some studies conclude that AI primarily favors experts, while others observe an equalizing effect. Both phenomena are real, but they operate on different metrics.

One specific cognitive factor largely explains synergy differences: Theory of Mind (ToM). That is, the ability to reason about others’ mental states. In the human-AI context, this amounts to estimating what the model knows, what it doesn’t know, what it is likely to interpret, and how it will react to a given formulation.

The authors measure ToM from dialogues between users and AI, using linguistic analysis tools validated by human annotation. They show that ToM does not significantly predict individual performance. However, it strongly predicts collaborative performance. In other words, understanding AI as an informational agent is crucial for leveraging it.

Moreover, ToM operates on two levels. There is a relatively stable trait: some individuals systematically display behaviors of clarification, contextualization, and progressive adjustment. These individuals obtain better quality AI responses. But there is also a dynamic dimension: for the same user, when ToM expression is higher on a given question, AI response quality increases. Synergy is therefore not only a property of people. It is also an activatable cognitive state.

This point has profound implications. It suggests that AI output quality depends partly on the user’s mental state, just as much as on the model’s architecture. Performance becomes a relational phenomenon.

The consequences for model design are major. Optimizing only autonomous reasoning capacity appears insufficient. We must now optimize sensitivity to implicit intentions. Management of conversational context, the ability to adapt to imperfect formulations, robustness in the face of successive corrections. In other words, we must train models to collaborate.

For businesses, this research also shifts priorities.

Choosing a model can no longer be based solely on benchmark rankings. It must integrate human amplification criteria. Employee training must move beyond learning prompting recipes to introduce metacognitive skills: making objectives explicit, structuring questions, evaluating critically, and iterating.

Talent is being redefined. An employee of average technical level, but with strong collaborative capacity, can outperform a solitary expert uncomfortable with interaction. Traditional competency frameworks become incomplete.

More broadly, this study proposes a paradigm reversal. For decades, AI has been evaluated as a potential substitute for humans. The implicit question was: when will the machine do better than us? The framework proposed by Riedl and Weidmann reformulates the question: under what conditions does the human-machine assembly produce intelligence superior to each of its components?

This shift is decisive. It paves the way for AI designed not as an autonomous entity aiming for solitary excellence, but as a cognitive infrastructure serving collective intelligence.

The challenge is no longer just to build brilliant models, but to create models that make humans better.