Translating AI to Clinical Practice: Overcoming Data Shift with Explainability

Published Online:

To translate AI into clinical practice, explainability provides useful techniques to detect and mitigate data shift—the data distribution mismatch between model training and real environments that reduces model performance.

To translate artificial intelligence (AI) algorithms into clinical practice requires generalizability of models to real-world data. One of the main obstacles to generalizability is data shift, a data distribution mismatch between model training and real environments. Explainable AI techniques offer tools to detect and mitigate the data shift problem and develop reliable AI for clinical practice. Most medical AI is trained with datasets gathered from limited environments, such as restricted disease populations and center-dependent acquisition conditions. The data shift that commonly exists in the limited training set often causes a significant performance decrease in the deployment environment. To develop a medical application, it is important to detect potential data shift and its impact on clinical translation. During AI training stages, from premodel analysis to in-model and post hoc explanations, explainability can play a key role in detecting model susceptibility to data shift, which is otherwise hidden because the test data have the same biased distribution as the training data. Performance-based model assessments cannot effectively distinguish the model overfitting to training data bias without enriched test sets from external environments. In the absence of such external data, explainability techniques can aid in translating AI to clinical practice as a tool to detect and mitigate potential failures due to data shift.

©RSNA, 2023

Quiz questions for this article are available in the supplemental material.


  • 1. Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ 2020;369:m1328. [Published correction appears in BMJ 2020;369:m2204.] Crossref, MedlineGoogle Scholar
  • 2. Roberts M, Driggs D, Thorpe M, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 2021;3(3):199–217. CrossrefGoogle Scholar
  • 3. Bevan PJ, Atapour-Abarghouei A. Skin Deep Unlearning: Artefact and Instrument Debiasing in the Context of Melanoma Classification. Proceedings of the 39th International Conference on Machine Learning. Proc Mach Learn Res 2022;162:1874–1892. Google Scholar
  • 4. Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. New York, NY: Springer, 2009; 219–260, 389–416. CrossrefGoogle Scholar
  • 5. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, Mass: MIT Press, 2016; 98–168. Google Scholar
  • 6. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York, NY: Springer, 2013; 15–58, 175–202. CrossrefGoogle Scholar
  • 7. Yu AC, Eng J. One algorithm may not fit all: how selection bias affects machine learning performance. RadioGraphics 2020;40(7):1932–1937. LinkGoogle Scholar
  • 8. Ye H, Xie C, Cai T, Li R, Li Z, Wang L. Towards a theoretical framework of out-of-distribution generalization. Adv Neural Inf Process Syst 2021;34:23519–23531. Google Scholar
  • 9. Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND, eds. Dataset shift in machine learning. Cambridge, Mass: MIT Press, 2008; 1–28. CrossrefGoogle Scholar
  • 10. Dou Q, Coelho de Castro D, Kamnitsas K, Glocker B. Domain generalization via model-agnostic learning of semantic features. Adv Neural Inf Process Syst 2019;32. Google Scholar
  • 11. Roscher R, Bohn B, Duarte MF, Garcke J. Explainable machine learning for scientific insights and discoveries. IEEE Access 2020;8:42200–42216. CrossrefGoogle Scholar
  • 12. Amann J, Blasimme A, Vayena E, Frey D, Madai VI;. Precise4Q consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak 2020;20(1):310. Crossref, MedlineGoogle Scholar
  • 13. Montavon G, Samek W, Müller KR. Methods for interpreting and understanding deep neural networks. Digit Signal Process 2018;73:1–5. CrossrefGoogle Scholar
  • 14. Adadi A, Berrada M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 2018;6:52138–52160. CrossrefGoogle Scholar
  • 15. Moons KG, de Groot JA, Bouwmeester W, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med 2014;11(10):e1001744. Crossref, MedlineGoogle Scholar
  • 16. Wolff R, Whiting P, Mallett S, Riley R, Westwood M, Kleijnen J. PROBAST: Prediction Model Risk of Bias Assessment Tool. Presented at the Evidence Synthesis Network: Systematic Reviews of Prognostic Studies—New Approaches to Prognostic Reviews and Qualitative Evidence Synthesis, University of Manchester, Manchester, England, May 27, 2014. Google Scholar
  • 17. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2020;2(2):e200029. LinkGoogle Scholar
  • 18. Castro DC, Walker I, Glocker B. Causality matters in medical imaging. Nat Commun 2020;11(1):3673. Crossref, MedlineGoogle Scholar
  • 19. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet 2019;51(4):584–591. [Published correction appears in Nat Genet 2021;53(5):763.] Crossref, MedlineGoogle Scholar
  • 20. Sundi D, Kryvenko ON, Carter HB, Ross AE, Epstein JI, Schaeffer EM. Pathological examination of radical prostatectomy specimens in men with very low risk disease at biopsy reveals distinct zonal distribution of cancer in black American men. J Urol 2014;191(1):60–67. Crossref, MedlineGoogle Scholar
  • 21. Litjens G, Toth R, van de Ven W, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med Image Anal 2014;18(2):359–373. Crossref, MedlineGoogle Scholar
  • 22. Tingelhoff K, Moral AI, Kunkel ME, et al. Comparison between Manual and Semi-automatic Segmentation of Nasal Cavity and Paranasal Sinuses from CT Images. In: 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, August 22–26, 2007. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2007;5505–5508. Google Scholar
  • 23. Walsh SL, Calandriello L, Sverzellati N, Wells AU, Hansell DM; UIP Observer Consort. Interobserver agreement for the ATS/ERS/JRS/ALAT criteria for a UIP pattern on CT. Thorax 2016;71(1):45–51. Crossref, MedlineGoogle Scholar
  • 24. Raghu G, Remy-Jardin M, Myers JL, et al. Diagnosis of idiopathic pulmonary fibrosis: an official ATS/ERS/JRS/ALAT clinical practice guideline. Am J Respir Crit Care Med 2018;198(5):e44–e68. Crossref, MedlineGoogle Scholar
  • 25. Sinha A, Dolz J. Multi-scale self-guided attention for medical image segmentation. IEEE J Biomed Health Inform 2021;25(1):121–130. Crossref, MedlineGoogle Scholar
  • 26. Nagarajan MB, Raman SS, Lo P, et al. Building a high-resolution T2-weighted MR-based probabilistic model of tumor occurrence in the prostate. Abdom Radiol (NY) 2018;43(9):2487–2496. Crossref, MedlineGoogle Scholar
  • 27. Wang W, Shen J. Deep visual attention prediction. IEEE Trans Image Process 2018;27(5):2368–2378. Crossref, MedlineGoogle Scholar
  • 28. Ren M, Zemel RS. End-to-End Instance Segmentation with Recurrent Attention. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, July 21–26, 2017. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017;293–301. CrossrefGoogle Scholar
  • 29. Guan Q, Huang Y, Zhong Z, Zheng Z, Zheng L, Yang Y. Thorax disease classification with attention guided convolutional neural network. Pattern Recognit Lett 2020;131:38–45. CrossrefGoogle Scholar
  • 30. Schlemper J, Oktay O, Schaap M, et al. Attention gated networks: learning to leverage salient regions in medical images. Med Image Anal 2019;53:197–207. Crossref, MedlineGoogle Scholar
  • 31. Yu W, Zhou H, Choi Y, Goldin JG, Teng P, Kim GH. An automatic diagnosis of idiopathic pulmonary fibrosis (IPF) using domain knowledge-guided attention models in HRCT images. In: Mazurowski MA, Drukker K, eds. Proceedings of SPIE: Medical Imaging 2021—Computer-aided Diagnosis. Vol 11597. Bellingham, Wash: International Society for Optics and Photonics, 2021;115971Y. CrossrefGoogle Scholar
  • 32. Yu W, Zhou H, Choi Y, Goldin JG, Kim GH. Mga-Net: Multi-Scale Guided Attention Models for an Automated Diagnosis of Idiopathic Pulmonary Fibrosis (IPF). In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, April 13–16, 2021. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2021; 1777–1780. Google Scholar
  • 33. Kim HG, Tashkin DP, Clements PJ, et al. A computer-aided diagnosis system for quantitative scoring of extent of lung fibrosis in scleroderma patients. Clin Exp Rheumatol 2010;28(5 Suppl 62):S26–S35. MedlineGoogle Scholar
  • 34. Shen Z, Liu J, He Y, et al. Towards out-of-distribution generalization: a survey. arXiv preprint arXiv:2108.13624. Posted August 31, 2021. Accessed February 2022. Google Scholar
  • 35. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, October 22–29, 2017. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017; 618–626. CrossrefGoogle Scholar
  • 36. Lipton ZC. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 2018;16(3):31–57. CrossrefGoogle Scholar
  • 37. Chong DY, Kim HJ, Lo P, et al. Robustness-driven feature selection in classification of fibrotic interstitial lung disease patterns in computed tomography using 3D texture features. IEEE Trans Med Imaging 2016;35(1):144–157. Crossref, MedlineGoogle Scholar
  • 38. Yu W, Zhou H, Choi Y, et al. Multi-scale, domain knowledge-guided attention + random forest: a two-stage deep learning-based multi-scale guided attention model to diagnose idiopathic pulmonary fibrosis from computed tomography images. Med Phys 2023;50(2):894–905. Crossref, MedlineGoogle Scholar

Article History

Received: Apr 28 2022
Revision requested: Aug 15 2022
Revision received: Sept 9 2022
Accepted: Sept 23 2022
Published online: Apr 27 2023