Abstract: SA-PO1227
Automated Structured Medical Data Extraction from Audio Recordings of Outpatient Nephrology Encounters Using Large Language Models
Session Information
- CKD: Biomarkers and Emerging Tools for Diagnosis and Monitoring
November 08, 2025 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: CKD (Non-Dialysis)
- 2302 CKD (Non-Dialysis): Clinical, Outcomes, and Trials
Authors
- Neri, Luca, Renal Research Institute, New York, New York, United States
- Kovarova, Vratislava, Fresenius Medical Care AG, Bad Homburg, HE, Germany
- Morillo Navarro, Kevin, Fresenius Medical Care AG, Bad Homburg, HE, Germany
- Silvestre-Llopis, Jordi, Fresenius Medical Care AG, Bad Homburg, HE, Germany
- Nehezova, Katarina, Fresenius Medical Care AG, Bad Homburg, HE, Germany
- Barbieri, Carlo, Fresenius Medical Care Italia SpA, Palazzo Pignano, Lombardia, Italy
- Bellocchio, Francesco, Renal Research Institute, New York, New York, United States
- Usvyat, Len A., Renal Research Institute, New York, New York, United States
- Casana-Eslava, Raul Vicente, Fresenius Medical Care AG, Bad Homburg, HE, Germany
Background
Physician documentation often imposes a substantial time burden and can be incomplete. We developed and tested an end-to-end tool leveraging large language model for structured medical data extraction from the encounter recording, with the goal of generating complete visit summaries minimizing manual post-editing.
Methods
Fifteen outpatient follow-up visits for non-dialysis dependent chronic kidney disease, conducted in a Nephrocare Clinic in Czech Republic, were processed sequentially: (1) audio→text transcription with Whisper; (2) machine translation to English via GPT-4; and (3) extraction of 25 ontology-defined data elements (Visit Type, History, Condition Evaluation, Vitals, Medication Review) using 33 GPT-4 prompts that specified each element’s required fields. All elements were counted in the evaluation. Accuracy was tested against annotations made by the attending physicians.
Results
Against gold-standard annotations, the pipeline achieved a macro-averaged F1 of 0.87, with 100% precision and 78% recall overall. Visit type, vitals, anthropometrics, recommendations, some elements of physical examinations and medical history exceeded 0.92 F1. Lower accuracy was obtained for lab values and medication name (F1=0.71 and F1=0.67). We observed zero false positives (i.e., no hallucinations) in 5 hours of recording.
Conclusion
GPT-4 can reliably automate transcription, translation, and structured extraction from non-English clinical audio, achieving high accuracy and minimal hallucination without manual correction. Future work will validate on larger, multilingual cohorts, perform granular error analyses, and pilot real-time scribing with physician oversight.