The first study is out to evaluate the safety of ChatGPT Health for medical triage

A few weeks ago, I woke up with a sore, pink eye. As a mother of young kids, my immediate thought was that it must be pink eye. My husband reassured me, and I turned to ChatGPT for further guidance. I got a list of most likely causes, including conjunctivitis, dry eye, or a minor scratch, but also a nudge to see a doctor within any “severe pain,” vision changes, light sensitivity, and other more severe symptoms.

As the pain started to increase, I texted a husband’s friend who happens to be an ophthalmology resident, and he urged me to get seen. Within a few hours, I was diagnosed with scleritis, a painful inflammatory condition of the eye’s outer layer, and prescribed a combination of antibiotics and steroids in a drop. Scleritis, if left untreated, can lead to vision loss and damage to the eye structure.

Every day, patients are making these decisions about triage largely on their own, often seeking out LLMs like ChatGPT or friends and family members. Most of us aren’t lucky enough to have text-based access to a specialist. So LLMs are filling that gap for millions of people, even as OpenAI has made clear they are not diagnostic tools.

A new study out in Nature Medicine from a group of researchers from institutions like Mount Sinai and the University of Miami’s School of Medicine provides a glimpse into how ChatGPT is steering patients to different clinical settings based on a set of test scenarios across a variety of levels of acuity. It is important to acknowledge that this study provides a window into how LLMs are performing today, and that the technology is constantly evolving.

The Study at a Glance

The study is the first independent safety evaluation of the large language model ChatGPT Health since its launch in January 2026. It was fast-tracked, given its societal importance, according to Mount Sinai’s news release, and included reviews from well-known medical experts like Dr. Eric Topol.

More than 40 million people per week are using OpenAI’s ChatGPT Health for health information and advice. So that motivated the researchers to explore the question of whether ChatGPT Health is safe for patients to use when they’re facing a true medical emergency.

Per Isaac S. Kohane, Chair of the Department of Biomedical Informatics at Harvard Medical School, these kinds of studies are critical as “the stakes are extraordinarily high.” Kohane was not involved with the research.

The study is titled “ChatGPT Health performance in a structured test of triage recommendations,” and it is paywalled.

The Methodology

The team created 60 structured clinical scenarios spanning 21 specialties. Three independent physicians determined the correct level of urgency for each case based on evidence-based guidelines.

The takeaways:

With respect to suicide-risk alerts, ChatGPT was designed to direct people to the 988 and Crisis Lifeline numbers in high-risk circumstances. But these alerts appeared “inconsistently,” including failing to appear when patients described plans for self-harm. The model behavior in mental health contexts requires particular attention, per the authors.
ChatGPT performed well in textbook emergencies, such as stroke or anaphylaxis, in cases where the patient was suffering from an extreme allergic reaction.
For edge cases, 96% of the responses fell within the acceptable range, defined as the lowest triage level considered clinically safe. But 60.8% chose the less urgent of the two acceptable options when both urgent and emergency were acceptable.
The most “concerning” finding, per the authors, was the 51.6% under-triage rate for the true emergencies, as missed emergencies can result in patient harm. There was also a 64.8% over-triage rate for non-urgent cases, but the authors found that to be less dangerous. But they also noted it would increase medical utilization at scale.

SIGN UP HERE

To be fair…

The study’s authors acknowledge the limitations with the approach. One major one is the lack of a control group. The three doctors that determined the right course of action may have had a lot more time to think through the appropriate next step in a controlled environment. In reality, many patients lack access to that level of information and are relying on tools like Google and WebMD or their friends and family.

Moreover, even the best human doctors and nurses make mistakes when it comes to triage. Making matters more complex, some patients have a tendency to under-report their symptoms, while others will over-report. There's also commonly misunderstandings related to language barriers, and reporting of symptoms may change based on race, gender, demographics, and so on.

That could provide an advantage for AI over time, assuming it’s used appropriately. As LLMs become more attuned to the individual, they may have a distinct edge when it comes to discerning whether a patient is likely to accurately self-report.

Bottom line: The physicians in my network acknowledge that triage-related mistakes are common with or without LLMs. Patients go to the emergency room every day who don’t need to be there, and others stay home who are desperately in need of care.

The next step?

The researchers don’t think we should stop using ChatGPT Health altogether (this simply is not viable in today’s world). But the group suggests the findings may warrant “premarket safety evaluation requirements analogous to medical devices.”

The authors also acknowledge that ChatGPT explicitly says it’s for health and wellness guidance, and not for diagnosis and treatment. But they correctly point out that this is unlikely to stop millions of people who already are using it as a tool for triage.

Three physician takes

“This study is highlighting a distinction that becomes more important every day - medical AI requires medical-grade engineering, evaluation, and continuous improvement,” said Dr. Byron Crowe, chief medical officer at Doctronic, a company that is rolling out an AI physician trained by top medical experts. “The wrong conclusion from this study is that frontier LLMs can’t accurately triage emergencies - they can do that quite well, including in other scenarios tested in this study. But complex systems fail in complex ways. A focus on systems thinking, understanding how AI interacts with the real world, and how to mitigate these types of failure modes, is what will separate high-performing medical AI from other tools over the long term.”

“As we’ve been saying online for years, Context is King, and it still is. Even here, when the authors changed phrases or added or omitted additional context, the model changed its recommendation. A human might ask additional questions before making a recommendation, because we know we need to gather additional context,” noted Dr. Graham Walker, an emergency medicine physician and a founder of Offcall and UpToDate.

“The other thing that stood out is the model’s under-triage of patients in diabetic ketoacidosis who are black vs white. While the study wasn’t powered to confirm a statistical difference, in Table 1 of the paper, you see that it under-triaged Black patients 4x more often. We’ve seen similar problems in numerous other papers that include race, sex/gender, class, and other demographic social variables. The models get triggered by certain words and can give markedly biased, discriminatory answers,” noted Walker, who has access to the paper and has been in touch with the study’s authors.

“Products that are built to engage consumers, keep them chatting, will never be cut out for real healthcare,” Dr. Tom Kelly, CEO of Heidi Health, shared with me. “You need to have the right reward signal for the models to perform in the ways we need for medical safety.” But Kelly also said the study needs a control group to be meaningful. “The standard isn’t perfection… it’s real triage which is imperfect,” he explained.

An idea from me

One path OpenAI could consider is more partnerships with many of the existing digital health providers that are already taking on triage, but with humans in the loop. There are already services that reach patients at scale and in all 50 states across pediatrics, women’s health, asthma, diabetes, and more. That’s an idea first proposed by physician and Scrub Capital GP Rebecca Mitchell on LinkedIn.

With more humans in the loop, there isn’t as much of a requirement - at least in the United States - to treat this technology like a highly-regulated medical device.

*Dr. Byron Crowe is a member of Second Opinion’s Editorial Advisory Council.

Want to support Second Opinion?

🌟 Leave a review for the Second Opinion Podcast
📧 Share this email with other friends in the healthcare space!
💵 Become a paid subscriber!
📢 Become a sponsor