ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: FR-PO0028

Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification

Session Information

Category: Artificial Intelligence, Digital Health, and Data Science

  • 300 Artificial Intelligence, Digital Health, and Data Science

Authors

  • Masaki, Mamoru, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
  • Noda, Ryunosuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
  • Kitano, Fumiya, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
  • Ichikawa, Daisuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
  • Shibagaki, Yugo, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Group or Team Name

  • Division of Nephrology and Hypertension, Department of Internal Medicine.
Background

Large Language Models (LLMs) show medical promise. Standard LLMs primarily leverage pattern recognition from vast datasets, whereas newer reasoning models are architecturally designed to enhance multi-step logical inference. However, whether this fundamental difference translates to superior performance by reasoning models in nephrology remains underexplored. We compared cutting-edge reasoning models against standard LLMs using nephrology multiple choice questions.

Methods

We used 209 self-assessment questions for nephrology board recertification from the Japanese Society of Nephrology (2014-2023). Reasoning models (OpenAI's o3, o3-2025-04-16; Google's Gemini 2.5 Pro, gemini-2.5-pro-preview-03-25) and standard models (OpenAI's GPT-4o, gpt-4o-2024-11-20; Google's Gemini 2.0 Flash, gemini-2.0-flash-001) were evaluated for accuracy via API. Accuracy was also analyzed by question characteristics (taxonomy, question type, image inclusion, subspecialty) and compared (chi-squared/Fisher's exact tests, p<0.05).

Results

Reasoning models o3 (89.5%) and Gemini 2.5 Pro (83.7%) had significantly higher overall accuracy than standard GPT-4o (69.9%) and Gemini 2.0 Flash (62.7%) (all p<0.001). No significant difference was found between the reasoning models (p=0.114). Reasoning models met passing threshold (≥60%) in all 10 years; standard models in 7/10 years. Reasoning models showed superiority for recall, problem-solving, general, clinical, and non-image questions. Subspecialty performance varied; reasoning models generally outperformed standard ones, but no significant differences were noted for interpretation or image questions.

Conclusion

Reasoning models significantly outperformed standard LLMs on nephrology multiple choice questions, showing high potential as educational and research support tools in nephrology. Responsible implementation requires further validation and expert oversight.

Digital Object Identifier (DOI)