Abstract: TH-PO0735
Deep Learning Detection of Stain- and Center-Specific Bias in Assessing Multicenter Lupus Nephritis Whole-Slide Images
Session Information
- Glomerular Innovations: Artificial Intelligence, Multiomics, and Biomarkers
November 06, 2025 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Glomerular Diseases
- 1402 Glomerular Diseases: Clinical, Outcomes, and Therapeutics
Authors
- Daouk, Mohammad, University of Houston, Houston, Texas, United States
- Becker, Jan U., Universitatsklinikum Koln Klinische Infektiologie, Cologne, NRW, Germany
- Kambham, Neeraja, Stanford University, Stanford, California, United States
- Chang, Anthony, The University of Chicago, Chicago, Illinois, United States
- Mohan, Chandra, University of Houston, Houston, Texas, United States
- Nguyen, Hien V, University of Houston, Houston, Texas, United States
Background
Convolutional neural networks (CNNs) built to grade Lupus Nephritis (LN) glomeruli may learn slide-specific “shortcuts” such as stain or scanner brand, rather than pathology. We measured the size of this bias in a multi-institutional cohort.
Methods
From 363 WSIs (4 stains (H&E, PAS, Masson trichrome, silver; 3 centers) we extracted 9 674 glomerular patches at three magnifications, z-normalized, and split by WSI into 85 % train and 15 % hold-out. On the 85 % set we ran 5-fold cross-validation (each fold: 80 % train, 20 % val). ResNet-18 (ImageNet-pretrained) was fine-tuned in three regimes:
Single-head predicting stain type.
Single-head predicting center type.
Dual-head with a lesion head (proliferative vs non-proliferative) plus a bias head (stain or center) whose loss was scaled by λ∈{10-1,…,10-4,0,–10-4,…,–10-1}. The weights control how strongly the model relies on the bias head during training. Negative λ inverts the bias target. Uncertainty was estimated via 50 Monte-Carlo dropout passes.
Results
At λ=10-1, stain=1.00, center=0.99, lesion=0.87. At λ=10-4, stain=0.30, center=0.88, lesion=0.86. For λ≤–10-3 bias head was silenced (≤0.05 accuracy), lesion dropped to 0.80. Across 45 models, stain vs lesion: Spearman r=0.57 (p<0.001); center vs lesion: r=0.49 (p=0.001).
Conclusion
Standard CNN pipelines readily exploit stain- and center-specific artefacts embedded in multi-center LN datasets, creating an illusion of high diagnostic performance. Mandatory bias audits, stain-invariant color augmentation, domain-adversarial or contrastive training, and cross-center external validation are essential before clinical deployment. Paradoxically, the strong stain signal also suggests that stain-dependent morphological information could be harnessed deliberately provided models are insulated from confounding. Rigorous characterization and mitigation of technical shortcuts are prerequisites for trustworthy AI-assisted renal pathology.
Funding
- Other NIH Support