Large Language Models don't reason better than physicians

❝

Dr. Shantanu Nundy MD, MBA, is a Primary Care Physician and an FDA AI Advisor

Dr. Shantanu Nundy is back to unpack whether large language models truly reason better than physicians.

Every two weeks, Nundy dissects a study that’s making the rounds among his peers, breaking down the methodology and the paper’s impact.

His second review looks at the performance of a large language model on the reasoning tasks of a physician, published in Science on April 30, 2026. The full article is only available to subscribers behind a paywall.

If you have a study you’d like Dr. Nundy to tackle, please do reach out to us and we’ll ensure he gets your note!

🔥 HOT TAKE

❝

A landmark study in Science has been touted as demonstrating that LLMs now reason better than physicians—but what it actually tested was differential diagnosis, not clinical reasoning.

❝

Methodology: 9/10
Importance: 6/10
Likely impact: LLMs have come a long way, but we can’t yet declare that they are better than doctors at clinical reasoning.
Ideal follow-up study: A turn-by-turn evaluation of how LLMs approach common diagnostic cases, assessing what questions they ask and how they synthesize extraneous information, including cases where the patient has no discernible diagnosis.

📋 CLIFF NOTES

Brodeur, Buckley, Kanjee, Goh et al. published what is arguably the most ambitious evaluation of LLM clinical performance to date in Science on April 30, 2026. The study team, spanning Harvard, Stanford, Beth Israel, and Microsoft Research, conducted five structured experiments comparing OpenAI’s o1 model against hundreds of physicians, then added a real-world arm with 79 randomly selected patients from the Beth Israel Deaconess emergency department in Boston.

The five experiments tested differential diagnosis on NEJM clinicopathological conference cases, NEJM Healer cases, triage differential, probabilistic reasoning, and management reasoning. In the real-world ED arm, o1 and two attending physicians gave blinded second opinions at three touchpoints: triage, initial evaluation, and admission. Evaluating physicians could not tell human from AI in over 83% of cases. In all experiments, o1 outperformed physicians.

Table: Selected results from Brodeur et al., Science (2026). o1 outperformed physician baselines across all experiments and in real-world emergency department cases.

🌐 BIGGER PICTURE

Let’s start with what this study got right because there is a lot. The authors tested across multiple case types, included hundreds of physician comparators, conducted real blinding with a formal check, and brought the evaluation into an actual emergency department with real patients. This is methodologically closer to how clinical AI should be evaluated than most of what has come before.

But the study’s central claim that it evaluated “reasoning tasks of a physician” deserves scrutiny: clinical reasoning, as Gruppen defines it in the Western Journal of Emergency Medicine, is fundamentally about the acquisition of information—the iterative process by which a physician decides what to ask, what to examine, what to order, and what not to. Basically, what every patient experiences when they go to see the doctor and the visit starts with the doctor asking a series of questions. In this study, the LLM never does any of that. Every case arrived after an expert human had already synthesized the history, chosen the exam findings, and decided which labs to include. The AI entered the loop after most of the cognitive work was done. It was generating differential diagnoses on a synthesized case, not using clinical reasoning to understand a real-world patient.

That LLMs now outperform physicians in differential diagnosis is a genuine technological achievement. Just a few years ago, it would have seemed implausible. But outperforming physicians on a pre-synthesized case is not the same as outperforming them on the full diagnostic process—and this paper never claimed to test the latter. The headlines did.

Reserve Your Spot for Upcoming Webinars!

Webinar Topic	Panelists’	Timing	Registration
Privacy AI and the future of HIPAA with the former founding director of ONC	Jodi Daniel, Christina Farr	June 3rd, 2026 At 12:00 PM (ET)	*Register Here*
Freeing Data From the EHR	Lisa Bari, Ryan Howells Ruth Reader	June 17th, 2026 At 12:00 PM (ET)	*Register Here*
Not everyone can access the Top 1% of physicians. Will AI change that?	Daniel Stein Christina Farr Fred Thiele	June 23rd, 2026 At 12:00 PM (ET)	*Register Here*

Large Language Models don't reason better than physicians - yet

🔥 HOT TAKE

📋 CLIFF NOTES

🌐 BIGGER PICTURE

Reserve Your Spot for Upcoming Webinars!

Reply

Keep Reading

Large Language Models don't reason better than physicians - yet

🔥 HOT TAKE

📋 CLIFF NOTES

🌐 BIGGER PICTURE

Reserve Your Spot for Upcoming Webinars!

Please join our mailing list to keep reading

Reply

Keep Reading