ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005


The Latest on Twitter

Kidney Week

Abstract: SA-PO004

AI in the Loop: Using Ensemble Model Agreement as a Surrogate for Segmentation Confidence in Renal Stone CT Evaluations

Session Information

  • Bioengineering
    November 05, 2022 | Location: Exhibit Hall, Orange County Convention Center‚ West Building
    Abstract Time: 10:00 AM - 12:00 PM

Category: Bioengineering

  • 300 Bioengineering


  • Kline, Timothy L., Mayo Foundation for Medical Education and Research, Rochester, Minnesota, United States

Deep learning-based semantic segmentation has been shown to perform at the level of human readers in a wide range of medical image processing tasks. However, the ability to automatically: (i) flag out of domain cases, or (ii) identify cases where a model may be less confident, has received much less attention. Here we develop an approach to provide insights into model confidence that can be built on top of common approaches for model ensembling. We show the utility of the approach in a highly imbalanced problem of segmentation of both kidneys, as well as renal stones.


A total of 400 non-contrast CT images were curated from our institutions image archive. Both kidneys and renal stones were segmented by quality review of previously developed segmentation algorithms. A deep learning framework was used to develop a 5-fold ensemble model in both 2D and 3D (300 cases for training/validation, 100 for testing). The individual folds and ensemble models were evaluated by similarity metrics. The variability of the models between different folds and input image dimensionality was used to establish our framework for creating an ‘AI in the Loop’ method to automatically flag cases the models were less confident about.


The automated models achieved excellent performance for segmentation of kidneys and renal stones on the hold-out test set. The mean±SD of Dice was 0.97±0.03 and 0.89±0.15 for kidney and stones, respectively. Comparing individual models to each other demonstrated how disagreement between models could be used as a surrogate for model confidence.


We developed a framework for automatically assessing model confidence by comparison of models trained on different data subsets and different model architectures. This approach will have utility in automated pipelines to draw attention to potential failure cases.


  • NIDDK Support