ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: FR-PO0030

Human-Rated Dialogue Evaluation of Large Language Model (LLM) Agents vs. Human Responses in a Nephrology Objective Structured Clinical Examination Scenario

Session Information

Category: Artificial Intelligence, Digital Health, and Data Science

  • 300 Artificial Intelligence, Digital Health, and Data Science

Authors

  • Yi, Yongjin, Dankook University Hospital, Cheonan-si, Chungcheongnam-do, Korea (the Republic of)
  • Kim, Sejoong, Seoul National University Bundang Hospital, Seongnam-si, Gyeonggi-do, Korea (the Republic of)
Background

Objective Structured Clinical Examinations (OSCEs) are essential for medical education, but standardized patients (SPs) require high costs and intensive training. To overcome this, we developed a large language model (LLM)-based virtual patient and evaluated its performance in an OSCE scenario involving a patient with red-colored urine.

Methods

A virtual patient was created using prompt-based clinical scenarios simulating hematuria. We used 23 standardized history-taking questions and generated responses from four LLMs—ChatGPT-4o, Claude-3.5 Sonnet, Llama-3.1 70B, and HyperCLOVA X—plus a human responses as a reference. Medical students rated anonymized dialogue sets using 5-point Likert scales for fluency, accuracy, and relevance. Complete dialogue sets were also rated for comprehensibility, hallucination, coherence, consistency, engagement, and overall satisfaction.

Results

Thirteen senior medical students (5 female; mean OSCE experience 57.0 hrs) participated. HyperCLOVA X scored highest in fluency (mean 4.85), Claude-3.5 Sonnet in accuracy (4.79), and Llama-3.1 70B in relevance (4.81). Human responses showed fluency/accuracy/relevance scores of 4.90/4.78/4.87. Overall metrics are illustrated as radar plots in Figure 2.

Conclusion

In this OSCE scenario, LLM-based virtual patients performed comparably to human responses. These findings support the potential use of LLM agents in clinical skills education in nephrology-focused OSCE training.

Mean 5-Likert score of fluency, accuracy, and relevance

Rader plots

Digital Object Identifier (DOI)