Abstract: FR-PO0030
Human-Rated Dialogue Evaluation of Large Language Model (LLM) Agents vs. Human Responses in a Nephrology Objective Structured Clinical Examination Scenario
Session Information
- Artificial Intelligence and Digital Health at the Bedside
November 07, 2025 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Artificial Intelligence, Digital Health, and Data Science
- 300 Artificial Intelligence, Digital Health, and Data Science
Authors
- Yi, Yongjin, Dankook University Hospital, Cheonan-si, Chungcheongnam-do, Korea (the Republic of)
- Kim, Sejoong, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, Korea (the Republic of)
Background
Objective Structured Clinical Examinations (OSCEs) are essential for medical education, but standardized patients (SPs) require high costs and intensive training. To overcome this, we developed a large language model (LLM)-based virtual patient and evaluated its performance in an OSCE scenario involving a patient with red-colored urine.
Methods
A virtual patient was created using prompt-based clinical scenarios simulating hematuria. We used 23 standardized history-taking questions and generated responses from four LLMs—ChatGPT-4o, Claude-3.5 Sonnet, Llama-3.1 70B, and HyperCLOVA X—plus a human responses as a reference. Medical students rated anonymized dialogue sets using 5-point Likert scales for fluency, accuracy, and relevance. Complete dialogue sets were also rated for comprehensibility, hallucination, coherence, consistency, engagement, and overall satisfaction.
Results
Thirteen senior medical students (5 female; mean OSCE experience 57.0 hrs) participated. HyperCLOVA X scored highest in fluency (mean 4.85), Claude-3.5 Sonnet in accuracy (4.79), and Llama-3.1 70B in relevance (4.81). Human responses showed fluency/accuracy/relevance scores of 4.90/4.78/4.87. Overall metrics are illustrated as radar plots in Figure 2.
Conclusion
In this OSCE scenario, LLM-based virtual patients performed comparably to human responses. These findings support the potential use of LLM agents in clinical skills education in nephrology-focused OSCE training.
Mean 5-Likert score of fluency, accuracy, and relevance
Rader plots