ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: FR-PO0025

Performance of Large Language Models in Analyzing Common Hypertension Scenarios in Clinical Practice

Session Information

Category: Artificial Intelligence, Digital Health, and Data Science

  • 300 Artificial Intelligence, Digital Health, and Data Science

Authors

  • Miao, Jing, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Zand, Jaleh, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Hommos, Musab S., Mayo Clinic Arizona, Scottsdale, Arizona, United States
  • Schwartz, Gary L., Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Taler, Sandra J., Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Nejat, Peyman, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Cheungpasitporn, Wisit, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Zoghby, Ziad, Mayo Clinic Minnesota, Rochester, Minnesota, United States
Background

Hypertension is the most prevalent chronic disease in primary care and a leading cause of cardiovascular morbidity and mortality. Despite existing guidelines, therapeutic inertia and suboptimal control persist. Large language models (LLMs) offer a potential valuable addition to augment clinical decision-making, yet their reliability for guideline-driven tasks remains unverified. This study evaluated the accuracy and safety of hypertension management recommendations generated by three LLMs compared to expert responses.

Methods

Fifty-one clinical vignettes representing 17 core hypertension management concepts were constructed by hypertension experts. Each case was submitted to three LLMs (GPT-4, Gemini, MedLM) and a hypertension expert also wrote the “gold standard” answers. Three blinded expert reviewers rated each response, a binary safety (safe/unsafe) scale, and attempted to identify the source (LLM vs. expert) providing the response. Ratings were analyzed using mean scores, percentages of accurate and safe responses, and inter-rater agreement.

Results

GPT-4 had the highest accuracy (83%) and safety (86%) scores among LLMs but remained inferior to expert responses (92% accuracy, 93% safety). Gemini and MedLM performed significantly worse (accuracy: 64% and 35%; safety: 73% and 39%, respectively). GPT-4 generated the most guideline-concordant responses (46%) among the three LLMs (Gemini 35%, MedLM 14%), but remains lower than experts’ responses (68%). Evaluators misidentified LLM responses as expert-written in 10 to 25% of cases, particularly with GPT-4. Inter-rater reliability for accuracy ratings was highest for expert-generated responses (ICC 0.81), with progressively lower agreement for GPT-4 (0.76), Gemini (0.70), and MedLM (0.68). A similar pattern was observed for safety and source discrimination ratings. The agreement was strongest for safety assessments and weakest for source discrimination.

Conclusion

Among three tested LLMs, GPT-4 demonstrated closer agreement to expert decisions thereby showing greater potential for supporting hypertension management. However, current LLMs’ versions frequently produce inaccurate or unsafe recommendations and remain inferior to expert judgment. Human-in-the-loop supervision remains essential when deploying LLMs for clinical decision-making.

Digital Object Identifier (DOI)