OpenAI’s o1-preview and o1 reasoning models have matched or surpassed physicians on certain experimental clinical reasoning tasks. The breakthrough lies not in replacing practitioners, but in the emergence of assistants capable of intervening at the heart of medical decision-making.

By Pascale Caron

The Debate on Medical AI is Changing Nature

For decades, clinical reasoning has been considered one of the most difficult domains to automate. An algorithm could identify an anomaly in an image, calculate a risk score, or compare a biological result to a standard. But establishing a diagnosis from incomplete information, prioritizing multiple hypotheses, and choosing the next examination still fell within the realm of human expertise.

The study published on April 30, 2026, in Science by Peter G. Brodeur and colleagues shifts this boundary. The researchers evaluated OpenAI’s o1-preview and o1 models on several dimensions of medical reasoning, comparing them to earlier generations of models and to hundreds of healthcare professionals. Their work does not demonstrate that artificial intelligence can practice medicine autonomously. It shows, more precisely, that a reasoning model can match or surpass physicians on certain diagnostic and decision-making tasks under defined experimental conditions.

This nuance is essential. Headlines announcing an AI “better than doctors” oversimplify a much more interesting reality. The publication signals less the disappearance of the practitioner than the arrival of a new category of cognitive tools. For entrepreneurs, MedTech companies, and healthcare institutions, the question therefore becomes strategic: how to transform laboratory performance into measurable improvement in care?

A More Demanding Methodology Than Classic Medical Benchmarks

Most medical evaluations of large language models have long relied on multiple-choice questionnaires or professional exams. These tests primarily measure knowledge recall. Yet real medicine requires something else: formulating a differential diagnosis, recognizing serious conditions not to be missed, updating a probability after an examination, and proposing a course of action.

The authors therefore mobilized several complementary protocols. They first used 143 clinicopathological conferences from the New England Journal of Medicine, complex cases that have served for decades as references for testing diagnostic support systems. They then evaluated the presentation of reasoning on 20 cases from the NEJM Healer program, then the quality of management decisions on clinical vignettes developed with experts. Another experiment focused on probabilistic reasoning. Finally, the team tested o1, GPT-4o, and two experienced physicians on 79 real cases from the Beth Israel Deaconess Medical Center emergency department, at three successive moments in the patient’s journey.

This last experiment is particularly important. Responses were anonymized and evaluated by two physicians who did not know whether they came from a human or a machine. The model did not directly care for patients. It produced a second opinion based on data available in the record. This is therefore a proof of concept, but a proof of concept anchored in unstructured, real clinical data.

Results That Require Looking Beyond the Headline Effect

On the 143 complex cases from the New England Journal of Medicine, o1-preview included the correct diagnosis in its list in 78.3% of cases. On the 70 cases already used to test GPT-4, it proposed an exact or very close diagnosis in 88.6% of cases, compared to 72.9% for GPT-4. When it had to choose the next diagnostic examination, its recommendation was judged exactly correct in 87.5% of situations and useful in an additional 11%.

The gaps are even more pronounced on the presentation of reasoning. In the NEJM Healer cases, o1-preview achieved a perfect score on 78 out of 80 evaluations, compared to 47 out of 80 for GPT-4, 28 out of 80 for senior physicians, and 16 out of 80 for residents. On five particularly complex management cases, its median score reached 86%, compared to 42% for GPT-4, 41% for physicians assisted by GPT-4, and 34% for physicians using conventional resources.

The emergency room records, however, provide the most directly readable result. At the triage stage, o1 proposed an exact or very close diagnosis in 65.8% of cases, compared to 54.4% and 48.1% for the two physicians. After medical evaluation, the rates reached 69.6% for o1, compared to 60.8% and 50.6%. At the time of admission, o1 reached 79.7%, compared to 75.9% and 68.4%.

These figures are impressive, but they must not be isolated from their context. The study does not conclude universal superiority of AI. On probabilistic reasoning, o1-preview showed no general improvement over GPT-4. Moreover, several comparisons rely on historical controls and a limited number of cases. The most solid conclusion is therefore as follows: reasoning models progress strongly on certain dimensions of diagnosis and management, but this progress remains heterogeneous.

What the Study Demonstrates, and What It Does Not Demonstrate

The study demonstrates that a model can produce very high-level textual reasoning when it receives a sufficiently structured clinical record. In the analyzed sample, the gap in favor of the model appears from the first evaluation point, when available information is still limited. This result, however, concerns the production of a differential diagnosis from the record, not the entire triage process. This capability could help broaden hypotheses, recall a rare pathology, or detect an inconsistency in the initial reasoning.

It does not demonstrate that a consumer chatbot constitutes a medical device. It does not prove that a model can replace physical examination, properly question a patient, understand their preferences, announce bad news, or arbitrate a conflict between several therapeutic objectives. It also does not measure impact on mortality, complications, care delays, or costs.

The authors acknowledge other limitations. The five experiments evaluate several important components of clinical reasoning without covering the entire complexity of medical practice. The cases mainly concern internal medicine and emergency care. Performance may vary according to specialty, patient profile, language, country, or care organization. The emergency room experiment evaluates a second opinion at predefined moments, not the entire set of triage, orientation, and treatment decisions.

The question of data contamination also remains open. The researchers did not observe a statistically significant difference between cases published before and after the supposed training cutoff date of the model. This control reduces doubt without being able to eliminate it entirely. Scientific caution therefore consists of considering these results as a robust signal of capability, not as definitive clinical validation.

What This Study Reveals for Businesses

Healthcare often constitutes a laboratory for major technological transformations. Quality, safety, and accountability requirements are among the highest. When an innovation begins to produce results in such a demanding environment, it is legitimate to question its implications in other sectors.

For businesses, this study reminds us that artificial intelligence is no longer just a productivity tool. It is progressively becoming a reasoning assistant capable of contributing to complex decisions, under human supervision.

This evolution already concerns many professions.

A lawyer analyzes case law before building an argument. An auditor identifies anomalies in financial data. A consultant confronts several hypotheses before recommending a strategy. An engineer compares different technical options before selecting the most relevant one.

In each of these cases, value does not come solely from knowledge. It results from the ability to reason from sometimes incomplete information.

The strategic challenge therefore no longer consists of asking whether AI will replace these professionals. It consists of determining which steps of their reasoning can be assisted, accelerated, or enriched, while maintaining human responsibility for final decisions.

For Entrepreneurs, Value Shifts Toward Integration

The most important lesson for entrepreneurs is not that a model achieved a higher score. It lies in the displacement of the value chain. When generalist models become capable of reasoning about complex medical problems, the competitive advantage is no longer located solely in access to the model. It is built in integration with the patient record, workstation ergonomics, data quality, human supervision, and measurement of results.

Opportunities are numerous. An assistant can propose a differential diagnosis during triage, verify that a serious hypothesis has not been forgotten, prepare a summary before the visit, or suggest the most discriminating examinations. In a telemedicine platform, it can help direct the patient to the appropriate level of care. In a hospital, it can constitute a second look when time pressure and cognitive load increase the risk of error.

But a credible product cannot be limited to an interface connected to an API. It must define a precise indication, identify its user, explain the moment when the tool intervenes, and measure the expected benefit. Does it reduce diagnostic delay? Does it avoid unnecessary examinations? Does it improve detection of serious situations? Does it decrease readmissions? Without a clinical or organizational indicator, model performance remains a demonstration without a sustainable business model.

The same logic applies to institutions. A hospital should not start by selecting a model, but by identifying a precise decision to improve, defining acceptable risk, choosing outcome indicators, and determining who retains final authority. This inversion of the approach avoids transforming a promising innovation into experimentation without measurable impact.

Interaction design also becomes decisive. An overly assertive AI can reinforce automation bias. An overly cautious AI can become unusable. The challenge is to present hypotheses, a level of uncertainty, and elements that could invalidate the recommendation. The product must support clinical judgment, not short-circuit it.

This requirement is confirmed by the randomized trial published in 2024 by Goh and colleagues. Access to GPT-4 did not statistically significantly improve the diagnostic reasoning of 50 physicians compared to conventional resources, even though the model used alone achieved a higher score in the protocol. The message is decisive for businesses: a high-performing AI does not automatically produce a high-performing human-AI team. The quality of the interface, training, timing of intervention, and the professional’s ability to challenge the recommendation determine much of the real value.

Regulatory Compliance Becomes a Strategic Asset

In Europe, AI software intended for medical use may fall under both the Medical Device Regulation and the AI Act. However, not all medical software using AI is automatically classified as high risk: qualification depends notably on the intended purpose, the system’s role in the product, and the applicable conformity assessment procedure. When a system falls into this category, requirements notably concern risk management, data quality, documentation, user information, and human oversight. Medical device rules continue, in parallel, to frame software qualification, classification, clinical evaluation, and post-market surveillance.

The European calendar remains progressive. The AI Act entered into force on August 1, 2024. Obligations do not all apply on the same date, and the calendar for high-risk systems related to products covered by Union harmonization legislation can extend until August 2, 2028, according to transitional provisions and implementing instruments. Leaders must therefore follow the texts applicable to their use case rather than reasoning from a single date.

This superposition should not be considered solely as a cost. It can become an entry barrier favorable to companies that develop early solid documentation, version traceability, a cybersecurity strategy, and a performance monitoring system. In healthcare, regulatory trust is part of the product.

Responsibilities must also be defined. Who is responsible for an erroneous recommendation? The model provider, the solution publisher, the institution, or the physician? At what point does an update sufficiently modify the system to require a new evaluation? These questions are not resolved after launch. They must structure technical architecture, contracts, and governance from design.

Toward Augmented Medicine, Conditional on Proof

This study does not describe the end of the physician. It rather announces a medicine in which the practitioner can confront their reasoning with a second rapid analysis, permanently available and capable of covering a very broad diagnostic space. The potential benefit is considerable, particularly in underserved areas, saturated services, or rare situations.

Success will depend, however, on how organizations absorb this capability. A high-performing model can produce little value if it increases the number of alerts, slows down work, or blurs responsibilities. Conversely, a less spectacular tool can transform care if it intervenes at the right time, with reliable information and an adapted interface.

The next steps must therefore be clinical and organizational. The authors call for prospective trials, studies on collaboration between humans and models, and robust surveillance frameworks. This is where the difference between a scientific advance and a genuinely useful innovation will be determined.

What Leaders Should Remember

The study published in Science probably does not mark the beginning of medicine without doctors.

It more significantly marks the entry of artificial intelligence into a new phase of its development: that of reasoning. This evolution extends far beyond the healthcare sector. It questions all professions where value creation relies on analysis, interpretation, and decision-making. For entrepreneurs, the question is therefore no longer whether to use artificial intelligence. It becomes more demanding.

  • What reasoning can be augmented by AI?
  • What risks must continue to fall exclusively to humans?
  • How to measure the value created by this collaboration?

Organizations that answer these questions first will likely have a lasting competitive advantage. Not because they will have chosen the best model, but because they will have learned to integrate artificial intelligence where it truly creates value.

Main References

Brodeur, P. G., et al. (2026). Performance of a large language model on the reasoning tasks of a physician. Science, 392(6797), 524-527. DOI: 10.1126/science.adz4433.

Brodeur, P. G., et al. (2024). Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv:2412.10849.

Cabral, S., et al. (2024). Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Internal Medicine, 184(5), 581-583.

Goh, E., et al. (2024). Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open, 7(10), e2440969.

European Commission. Artificial Intelligence in healthcare; MDCG 2025-6, FAQ on the articulation between medical device regulations and the AI Act; AI Act, application calendar and guidelines on high-risk systems. Official sources consulted on July 3, 2026.

European Commission (2026). Guidelines for providers and deployers of AI high-risk systems; Standardisation of the AI Act. Consulted on July 3, 2026.