ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: FR-PO047

Performance of Large Language Models in Self-Assessment Questions for Nephrology Board Renewal: Comparative Study of ChatGPT (GPT-3.5, GPT-4) and Bard

Session Information

Category: Augmented Intelligence, Digital Health, and Data Science

  • 300 Augmented Intelligence, Digital Health, and Data Science

Authors

  • Noda, Ryunosuke, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
  • Izaki, Yuto, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
  • Kitano, Fumiya, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
  • Komatsu, Jun, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
  • Ichikawa, Daisuke, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
  • Shibagaki, Yugo, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan
Background

The GPT series, Large Language Models (LLMs) pre-trained on vast data, have highly influenced recent Natural Language Processing advances. While GPT-4 demonstrated high accuracy in US law and medical exams, its performance in specialized areas like nephrology is unclear. This study aimed to compare ChatGPT (GPT-3.5, GPT-4), Bard, and their potential clinical applications in area of nephrology.

Methods

In this study, 99 questions from the "Self-Assessment Questions for Nephrology Board Renewal" from the years 2018-2022 were presented to two versions of ChatGPT plus (GPT-3.5, GPT-4), and Bard. The prompts were presented in Japanese, beginning with "I will now present a problem related to kidneys. Please answer in the form of 'answer', 'explanation'." Questions that included images only used the text of the question as input. We calculated the overall correct answer rates for the five years and each year, and checked whether they exceeded the pass criterion of a correct answer rate of ≥ 60%. We also conducted a comparative study of the correct answer rates by category and image presence. Statistical analysis was performed using Chi-square tests and Fisher's exact tests.

Results

The overall correct answer rates for GPT-3.5, GPT-4, and Bard were 31.3% (31/99), 54.5% (54/99), and 32.3% (32/99), respectively, thus GPT-4 showed significantly higher accuracy than GPT-3.5 (p < 0.01) and Bard (p < 0.01). While GPT-3.5 and Bard did not meet the pass criteria in any year, GPT-4 met the pass criteria in three years. GPT-4 showed significantly higher accuracy in clinical questions and non-image questions compared to GPT-3.5 (p = 0.01, p < 0.01) and Bard (p = 0.02, p < 0.01). No significant differences were observed between GPT-3.5 and Bard in any analysis.

Conclusion

GPT-4 significantly outperformed GPT-3.5 and Bard in overall accuracy, clinical, and non-image queries. It met Japanese Nephrology Board renewal standards in three of five years, with future improvement expected in image input and nephrology-specific fine-tuning. These findings underline the potential application of LLMs in nephrology and their pros and cons. As LLMs advance, nephrologists should understand their performance and reliability for future applications.