Review: Our first in a new series evaluating the latest medical AI studies

❝

Dr. Shantanu Nundy MD, MBA, is a Primary Care Physician and an FDA AI Advisor

Readers, we’re always looking to evolve in how we cover the intersection of health and technology. So we couldn’t say no when AI expert Dr. Shantanu Nundy offered to review the latest research globally, and unpack the relative importance. There’s an immense amount of noise in the market today and not enough signal.

So every two weeks, he will unpack a study that’s making the rounds amongst his peer set and dig into the methodology and likely impact.

If you have a study you’d like Dr. Nundy to tackle, please do reach out to us and we’ll ensure he gets your note!

His first review looks at a study with the title: Automation Bias in LLM-Assisted Diagnostic Reasoning Among AI-Trained Physicians, which was published on April 23, 2026. The full article is only available to subscribers behind a paywall.

🔥 HOT TAKE

❝

A new RCT in NEJM AI shows that even doctors trained to use AI are biased by bad AI recommendations. This suggests the problem isn't with physicians; it's that we have no independent way to know when AI is wrong in the first place.

❝

Methodology: 8/10
Importance: 10/10
Likely impact of the study: Should make health systems more wary about doctor-facing AI without rigorous independent benchmarking
Ideal follow-up study: Replicating study in US context with specialty-matched cases and doctors

📋 CLIFF NOTES

Qazi et al. conducted a single-blind randomized clinical trial enrolling 44 physicians across multiple institutions in Pakistan, all of whom had completed a rigorous 20-hour AI literacy training covering LLM capabilities, prompt engineering, and critical evaluation of AI outputs. Physicians were randomized 1:1 to diagnose six clinical vignettes with either accurate ChatGPT-4o suggestions (control) or suggestions containing deliberately introduced errors in 3 of 6 cases (treatment).

Importantly, LLM consultation was entirely voluntary. Physicians could choose to consult, modify, or ignore AI output at any point. Despite their training, physicians exposed to erroneous AI showed a 14 percentage-point drop in diagnostic accuracy.

To me, more interesting than the headline result are two subgroup analyses: 1) more experienced physicians who at baseline had higher diagnostic accuracy had a greater degradation in their diagnostic performance than less experienced physicians (statistically significant); 2) physicians who reported more frequent LLM use prior to the study tended towards greater degradation in diagnostic performance (not statistically significant). These results, if confirmed in other studies, would suggest that as physicians use these tools more, we may paradoxically see worse outcomes, not better.

Table: Key outcome from Qazi et al., NEJM AI (2026). Physicians with identical AI literacy training performed significantly worse when exposed to erroneous LLM recommendations, even when consultation was optional.

Reserve Your Spot for Upcoming Webinars!

Webinar Topic	Panelists’	Timing	Registration
What will AI do for employer healthcare and benefits?	Nick Reber Ellen Kelsay, Christina Farr	May 19th, 2026 At 3:00 PM (ET)	*Register Here*
Privacy AI and the future of HIPAA with the former founding director of ONC	Jodi Daniel, Christina Farr	June 3rd, 2026 At 12:00 PM (ET)	*Register Here*
Freeing Data From the EHR	Lisa Bari, Ryan Howells Ruth Reader	June 17th, 2026 At 12:00 PM (ET)	*Register Here*
Not everyone can access the Top 1% of physicians. Will AI change that?	Daniel Stein Christina Farr Fred Thiele	June 23rd, 2026 At 12:00 PM (ET)	*Register Here*

🌐 BIGGER PICTURE

This finding doesn't exist in isolation. A growing body of evidence — from Goh et al. in JAMA Network Open to the "From Tool to Teammate" RCT published last month in npj Digital Medicine — consistently shows that how AI is deployed matters as much as its performance in isolation. But this paper adds a darker dimension: you can train physicians on how to use AI, give them autonomy on when and how to use it, and it still may not be enough.

So what does that mean for healthcare leaders and investors? The U.S. clinical AI market is projected to exceed $45 billion by 2030, yet much of the investment and attention is still going to building clinical AI systems and demonstrating their performance on curated benchmarks. What it hasn't built is the infrastructure to evaluate those systems in conditions that reflect clinical reality: adversarial inputs, diverse patient populations, and real physician workflows. That’s the gap the field needs to close.

💬 MY LONGER TAKE

I've spent the better part of two decades at the intersection of human and machine intelligence in medicine. The finding that jumps out to me isn't the 14-point accuracy drop. It's that it happened in physicians who chose to consult the AI. These were engaged, trained clinicians exercising judgment about when AI was appropriate to use, and even then, the AI misled them. That tells me the problem is upstream of the physician entirely.

Review: Our first in a new series evaluating the latest medical AI studies

🔥 HOT TAKE

📋 CLIFF NOTES

Reserve Your Spot for Upcoming Webinars!

🌐 BIGGER PICTURE

💬 MY LONGER TAKE

Reply

Keep Reading

Review: Our first in a new series evaluating the latest medical AI studies

🔥 HOT TAKE

📋 CLIFF NOTES

Reserve Your Spot for Upcoming Webinars!

🌐 BIGGER PICTURE

💬 MY LONGER TAKE

Please join our mailing list to keep reading

Reply

Keep Reading