ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

The Latest on X

Kidney Week

ASN / Education & Meetings / Kidney Week /

Abstract: FR-PO0028

Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification

Session Information

Artificial Intelligence and Digital Health at the Bedside
November 07, 2025 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM

Category: Artificial Intelligence, Digital Health, and Data Science

300 Artificial Intelligence, Digital Health, and Data Science

Authors

Masaki, Mamoru, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Noda, Ryunosuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Kitano, Fumiya, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Ichikawa, Daisuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Shibagaki, Yugo, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan

Group or Team Name

Division of Nephrology and Hypertension, Department of Internal Medicine.

Background

Large Language Models (LLMs) show medical promise. Standard LLMs primarily leverage pattern recognition from vast datasets, whereas newer reasoning models are architecturally designed to enhance multi-step logical inference. However, whether this fundamental difference translates to superior performance by reasoning models in nephrology remains underexplored. We compared cutting-edge reasoning models against standard LLMs using nephrology multiple choice questions.

Methods

We used 209 self-assessment questions for nephrology board recertification from the Japanese Society of Nephrology (2014-2023). Reasoning models (OpenAI's o3, o3-2025-04-16; Google's Gemini 2.5 Pro, gemini-2.5-pro-preview-03-25) and standard models (OpenAI's GPT-4o, gpt-4o-2024-11-20; Google's Gemini 2.0 Flash, gemini-2.0-flash-001) were evaluated for accuracy via API. Accuracy was also analyzed by question characteristics (taxonomy, question type, image inclusion, subspecialty) and compared (chi-squared/Fisher's exact tests, p<0.05).

Results

Reasoning models o3 (89.5%) and Gemini 2.5 Pro (83.7%) had significantly higher overall accuracy than standard GPT-4o (69.9%) and Gemini 2.0 Flash (62.7%) (all p<0.001). No significant difference was found between the reasoning models (p=0.114). Reasoning models met passing threshold (≥60%) in all 10 years; standard models in 7/10 years. Reasoning models showed superiority for recall, problem-solving, general, clinical, and non-image questions. Subspecialty performance varied; reasoning models generally outperformed standard ones, but no significant differences were noted for interpretation or image questions.

Conclusion

Reasoning models significantly outperformed standard LLMs on nephrology multiple choice questions, showing high potential as educational and research support tools in nephrology. Responsible implementation requires further validation and expert oversight.

Digital Object Identifier (DOI)

doi: 10.1681/ASN.2025879vrpge

ASN's Mission

Contact ASN

The Latest on X

Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification

Abstract: FR-PO0028

Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification

Session Information

Category: Artificial Intelligence, Digital Health, and Data Science

Authors

Mamoru Masaki

Ryunosuke Noda, MD

Fumiya Kitano, MD

Daisuke Ichikawa, MD

Yugo Shibagaki, MD

Group or Team Name

Background

Methods

Results

Conclusion

Digital Object Identifier (DOI)