Abstract: PUB045
GPT-4o Assists Systematic Review Screening with High Precision
Session Information
Category: Artificial Intelligence, Digital Health, and Data Science
- 300 Artificial Intelligence, Digital Health, and Data Science
Authors
- Bergling, Karin, Renal Research Institute, New York, New York, United States
- Yueh, Sheng-Han, Renal Research Institute, New York, New York, United States
- Lama, Suman Kumar, Renal Research Institute, New York, New York, United States
- Willetts, Joanna, Renal Research Institute, New York, New York, United States
- Blankenship, Derek, Renal Research Institute, New York, New York, United States
- Usvyat, Len A., Renal Research Institute, New York, New York, United States
- Winter, Anke, Renal Research Institute, New York, New York, United States
- Raimann, Jochen G., Fresenius Medical Care North America, New York, New York, United States
- Zhang, Hanjie, Renal Research Institute, New York, New York, United States
Background
Screening abstracts for systematic reviews is labor-intensive and time-consuming. We here examine the ability of GPT-4o (OpenAI) to assist screening without loss of quality.
Methods
1,628 abstracts previously screened in a systematic review on adverse effects of systemic heparin during maintenance hemodialysis, labeled as “keep” or “discard” per Population–Intervention–Comparator–Outcome–Study (PICOS) criteria, were split into a validation (n=277, 17%) and a test set (n=1,351, 83%). During validation, PICOS criteria were translated into seven questions (e.g., “Does the abstract describe systemic heparin use?” - simplified example). GPT-4o was prompted to answer each question with “yes,” “no,” or “uncertain.” Response patterns determined the final classification as “keep,” “discard,” or “uncertain.” To prioritize precision in discarding non-relevant abstracts, “uncertain” responses were classified as “keep”. Performance was measured against independent review by two researchers.
Results
In the validation subset, GPT-4o achieved 89% accuracy. For “discard,” precision and recall were 0.97 and 0.90; for “keep,” 0.53 and 0.82, reflecting the decision to classify uncertain cases as “keep”, to avoid excluding relevant studies. In the test cohort, GPT-4o classified 34 abstracts as “discard” that were labeled “keep” by human reviewers. Re-evaluation found 17 were in fact irrelevant and the ground truth was updated. Accuracy was 85%. For “discard,” precision and recall were 0.98 and 0.84 (F1: 0.91); for “keep,” 0.38 and 0.87. The confusion matrix included 1,025 true negatives, 117 true positives, 192 false positives, and 17 false negatives (Fig. 1).
Conclusion
GPT-4o showed promising precision in discarding irrelevant studies and a recall-focused approach preserved relevant ones. These findings suggest large language models may aid in reducing manual screening effort while maintaining inclusion quality.
Agreement between GPT-4o and human reviewers. Left: Overlap in “keep” decisions; right: in “discard” decisions. Numbers indicate abstract counts.
Funding
- Commercial Support – Renal Research Institute LLC