Abstract: FR-PO0028
Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification
Session Information
- Artificial Intelligence and Digital Health at the Bedside
November 07, 2025 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Artificial Intelligence, Digital Health, and Data Science
- 300 Artificial Intelligence, Digital Health, and Data Science
Authors
- Masaki, Mamoru, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
- Noda, Ryunosuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
- Kitano, Fumiya, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
- Ichikawa, Daisuke, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
- Shibagaki, Yugo, Sei Marianna Ika Daigaku, Kawasaki, Kanagawa Prefecture, Japan
Group or Team Name
- Division of Nephrology and Hypertension, Department of Internal Medicine.
Background
Large Language Models (LLMs) show medical promise. Standard LLMs primarily leverage pattern recognition from vast datasets, whereas newer reasoning models are architecturally designed to enhance multi-step logical inference. However, whether this fundamental difference translates to superior performance by reasoning models in nephrology remains underexplored. We compared cutting-edge reasoning models against standard LLMs using nephrology multiple choice questions.
Methods
We used 209 self-assessment questions for nephrology board recertification from the Japanese Society of Nephrology (2014-2023). Reasoning models (OpenAI's o3, o3-2025-04-16; Google's Gemini 2.5 Pro, gemini-2.5-pro-preview-03-25) and standard models (OpenAI's GPT-4o, gpt-4o-2024-11-20; Google's Gemini 2.0 Flash, gemini-2.0-flash-001) were evaluated for accuracy via API. Accuracy was also analyzed by question characteristics (taxonomy, question type, image inclusion, subspecialty) and compared (chi-squared/Fisher's exact tests, p<0.05).
Results
Reasoning models o3 (89.5%) and Gemini 2.5 Pro (83.7%) had significantly higher overall accuracy than standard GPT-4o (69.9%) and Gemini 2.0 Flash (62.7%) (all p<0.001). No significant difference was found between the reasoning models (p=0.114). Reasoning models met passing threshold (≥60%) in all 10 years; standard models in 7/10 years. Reasoning models showed superiority for recall, problem-solving, general, clinical, and non-image questions. Subspecialty performance varied; reasoning models generally outperformed standard ones, but no significant differences were noted for interpretation or image questions.
Conclusion
Reasoning models significantly outperformed standard LLMs on nephrology multiple choice questions, showing high potential as educational and research support tools in nephrology. Responsible implementation requires further validation and expert oversight.