ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: PUB045

GPT-4o Assists Systematic Review Screening with High Precision

Session Information

Category: Artificial Intelligence, Digital Health, and Data Science

  • 300 Artificial Intelligence, Digital Health, and Data Science

Authors

  • Bergling, Karin, Renal Research Institute, New York, New York, United States
  • Yueh, Sheng-Han, Renal Research Institute, New York, New York, United States
  • Lama, Suman Kumar, Renal Research Institute, New York, New York, United States
  • Willetts, Joanna, Renal Research Institute, New York, New York, United States
  • Blankenship, Derek, Renal Research Institute, New York, New York, United States
  • Usvyat, Len A., Renal Research Institute, New York, New York, United States
  • Winter, Anke, Renal Research Institute, New York, New York, United States
  • Raimann, Jochen G., Fresenius Medical Care North America, New York, New York, United States
  • Zhang, Hanjie, Renal Research Institute, New York, New York, United States
Background

Screening abstracts for systematic reviews is labor-intensive and time-consuming. We here examine the ability of GPT-4o (OpenAI) to assist screening without loss of quality.

Methods

1,628 abstracts previously screened in a systematic review on adverse effects of systemic heparin during maintenance hemodialysis, labeled as “keep” or “discard” per Population–Intervention–Comparator–Outcome–Study (PICOS) criteria, were split into a validation (n=277, 17%) and a test set (n=1,351, 83%). During validation, PICOS criteria were translated into seven questions (e.g., “Does the abstract describe systemic heparin use?” - simplified example). GPT-4o was prompted to answer each question with “yes,” “no,” or “uncertain.” Response patterns determined the final classification as “keep,” “discard,” or “uncertain.” To prioritize precision in discarding non-relevant abstracts, “uncertain” responses were classified as “keep”. Performance was measured against independent review by two researchers.

Results

In the validation subset, GPT-4o achieved 89% accuracy. For “discard,” precision and recall were 0.97 and 0.90; for “keep,” 0.53 and 0.82, reflecting the decision to classify uncertain cases as “keep”, to avoid excluding relevant studies. In the test cohort, GPT-4o classified 34 abstracts as “discard” that were labeled “keep” by human reviewers. Re-evaluation found 17 were in fact irrelevant and the ground truth was updated. Accuracy was 85%. For “discard,” precision and recall were 0.98 and 0.84 (F1: 0.91); for “keep,” 0.38 and 0.87. The confusion matrix included 1,025 true negatives, 117 true positives, 192 false positives, and 17 false negatives (Fig. 1).

Conclusion

GPT-4o showed promising precision in discarding irrelevant studies and a recall-focused approach preserved relevant ones. These findings suggest large language models may aid in reducing manual screening effort while maintaining inclusion quality.

Agreement between GPT-4o and human reviewers. Left: Overlap in “keep” decisions; right: in “discard” decisions. Numbers indicate abstract counts.

Funding

  • Commercial Support – Renal Research Institute LLC

Digital Object Identifier (DOI)