Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction

In this article, we review some of the key methodologic points that should be considered with regard to clinical evaluation of artificial intelligence tools for use in medical diagnosis and prediction.

The use of artificial intelligence in medicine is currently an issue of great interest, especially with regard to the diagnostic or predictive analysis of medical images. Adoption of an artificial intelligence tool in clinical practice requires careful confirmation of its clinical utility. Herein, the authors explain key methodology points involved in a clinical evaluation of artificial intelligence technology for use in medicine, especially high-dimensional or overparameterized diagnostic or predictive models in which artificial deep neural networks are used, mainly from the standpoints of clinical epidemiology and biostatistics. First, statistical methods for assessing the discrimination and calibration performances of a diagnostic or predictive model are summarized. Next, the effects of disease manifestation spectrum and disease prevalence on the performance results are explained, followed by a discussion of the difference between evaluating the performance with use of internal and external datasets, the importance of using an adequate external dataset obtained from a well-defined clinical cohort to avoid overestimating the clinical performance as a result of overfitting in high-dimensional or overparameterized classification model and spectrum bias, and the essentials for achieving a more robust clinical evaluation. Finally, the authors review the role of clinical trials and observational outcome studies for ultimate clinical verification of diagnostic or predictive artificial intelligence tools through patient outcomes, beyond performance metrics, and how to design such studies.

© RSNA, 2018

References

  • 1. Chen JH, Asch SM. Machine learning and prediction in medicine: beyond the peak of inflated expectations. N Engl J Med 2017;376(26):2507–2509.
  • 2. Cabitza F, Rasoini R, Gensini GF. Unintended consequences of machine learning in medicine. JAMA 2017;318(6):517–518.
  • 3. Lee J-G, Jun S, Cho Y-W, et al. Deep learning in medical imaging: general overview. Korean J Radiol 2017;18(4):570–584.
  • 4. Kohli M, Prevedello LM, Filice RW, Geis JR. Implementing machine learning in radiology practice and research. AJR Am J Roentgenol 2017;208(4):754–760.
  • 5. Zhou SK, Greenspan H, Shen D. Deep learning for medical image analysis. San Diego, Calif: Elsevier, 2017.
  • 6. Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316(22):2402–2410.
  • 7. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115–118.
  • 8. Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017;318(22):2199–2210.
  • 9. Vandenberghe ME, Scott ML, Scorer PW, Söderberg M, Balcerzak D, Barker C. Relevance of deep learning to facilitate the diagnosis of HER2 status in breast cancer. Sci Rep 2017;7:45938.
  • 10. Yu KH, Zhang C, Berry GJ, et al. Predicting non–small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun 2016;7:12474.
  • 11. Lakhani P, Sundaram B. Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017;284(2):574–582.
  • 12. Prevedello LM, Erdal BS, Ryu JL, et al. Automated critical test findings identification and online notification system using artificial intelligence in imaging. Radiology 2017;285(3):923–931.
  • 13. Yasaka K, Akai H, Abe O, Kiryu S. Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: a preliminary study. Radiology doi: 10.1148/radiol.2017170706. Published online October 27, 2017.
  • 14. An intuitive explanation of convolutional neural networks. The data science blog. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets. Accessed August 11, 2017.
  • 15. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, Mass: MIT Press, 2016.
  • 16. Enhancing the quality and transparency of health research. EQUATOR Network website. http://www.equator-network.org. Accessed October 5, 2017.
  • 17. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350(jan07 4):g7594.
  • 18. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, NY: Springer, 2009.
  • 19. Park SH, Goo JM, Jo CH. Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J Radiol 2004;5(1):11–18.
  • 20. Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology 2003;229(1):3–8.
  • 21. Obuchowski NA. ROC analysis. AJR Am J Roentgenol 2005;184(2):364–372.
  • 22. Pencina MJ, D’Agostino RB Sr. Evaluating discrimination of risk prediction models: the C statistic. JAMA 2015;314(10):1063–1064.
  • 23. Youden WJ. Index for rating diagnostic tests. Cancer 1950;3(1):32–35.
  • 24. Schoonjans F. Receiver operating characteristic (ROC) curve analysis. In: MedCalc manual: easy-to-use statistical software. Ostend, Belgium: MedCalc Software, 2017; 222.
  • 25. Steyerberg EW. Evaluation of performance. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 255–279.
  • 26. Han K, Song K, Choi BW. How to develop, validate, and compare clinical prediction models involving radiological parameters: study design and statistical methods. Korean J Radiol 2016;17(3):339–350.
  • 27. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med 1997;16(9):965–980.
  • 28. Steyerberg EW. Overfitting and optimism in prediction models. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 83–99.
  • 29. Aerts HJ. The potential of radiomic-based phenotyping in precision medicine: a review. JAMA Oncol 2016;2(12):1636–1642.
  • 30. Steyerberg EW. Validation of prediction models. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 299–310.
  • 31. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med 1999;130(6):515–524.
  • 32. Sica GT. Bias in research studies. Radiology 2006;238(3):780–789.
  • 33. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 2015;277(3):826–832.
  • 34. Caraguel CG, Vanderstichel R. The two-step Fagan’s nomogram: ad hoc interpretation of a diagnostic test result without calculation. Evid Based Med 2013;18(4):125–128.
  • 35. Fagan TJ. Nomogram for Bayes theorem [letter]. N Engl J Med 1975;293(5):257.
  • 36. Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155(8):529–536.
  • 37. Korevaar DA, Hooft L, Askie LM, et al. Facilitating prospective registration of diagnostic accuracy studies: a STARD initiative. Clin Chem 2017;63(8):1331–1341.
  • 38. Eng J. Sample size estimation: how many individuals should be studied? Radiology 2003;227(2):309–313.
  • 39. McGlothlin AE, Lewis RJ. Minimal clinically important difference: defining what really matters to patients. JAMA 2014;312(13):1342–1343.
  • 40. Ahn S, Park SH, Lee KH. How to demonstrate similarity by using noninferiority and equivalence statistical testing in radiology research. Radiology 2013;267(2):328–338.
  • 41. Obuchowski NA. Sample size calculations in studies of test accuracy. Stat Methods Med Res 1998;7(4):371–392.
  • 42. U.S. Food and Drug Administration website. https://www.accessdata.fda.gov/cdrh_docs/pdf16/k163253.pdf. Accessed October 7, 2017.
  • 43. U.S. Food and Drug Administration website. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpma/pma_template.cfm?id=p150043. Accessed October 7, 2017.
  • 44. INFANT Collaborative Group. Computerised interpretation of fetal heart rate during labour (INFANT): a randomised controlled trial. Lancet 2017;389(10080):1719–1729.
  • 45. Brocklehurst P; INFANT Collaborative Group. A study of an intelligent system to support decision making in the management of labour using the cardiotocograph: the INFANT study protocol. BMC Pregnancy Childbirth 2016;16(1):10.
  • 46. Newman TB, Browner WS, Cummings SR, Hulley SB. Designing studies of medical tests. In: Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing clinical research. 4th ed. Philadelphia, Pa: Lippincott Williams & Wilkins, 2013; 171–191.
  • 47. Belfort MA, Clark SL. Computerised cardiotocography: study design hampers findings. Lancet 2017;389(10080):1674–1676.
  • 48. Goodacre S. Uncontrolled before-after studies: discouraged by Cochrane and the EMJ. Emerg Med J 2015;32(7):507–508.
  • 49. Goodacre S, Mason S, Kersh R, Webster A, Samaniego N, Morris F. Can additional experienced staff reduce emergency medical admissions? Emerg Med J 2004;21(1):51–53.
  • 50. Brinkman SA, Johnson SE, Codde JP, et al. Efficacy of infant simulator programmes to prevent teenage pregnancy: a school-based cluster randomised controlled trial in Western Australia. Lancet 2016;388(10057):2264–2271.
  • 51. Knottnerus JA, Dinant G-J, van Schayck OP. The diagnostic before-after study to assess clinical impact. In: Knottnerus JA, Buntinx F, eds. The evidence base of clinical diagnosis: theory and methods of diagnostic research. 2nd ed. West Sussex, England: Wiley-Blackwell, 2009; 83–95.
  • 52. Newman TB, Browner WS, Hulley SB. Enhancing causal inference in observational studies. In: Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing clinical research. 4th ed. Philadelphia, Pa: Lippincott Williams & Wilkins, 2013; 117–136.
  • 53. Kyriacou DN, Lewis RJ. Confounding by indication in clinical research. JAMA 2016;316(17):1818–1819.
  • 54. Baek S, Park SH, Won E, Park YR, Kim HJ. Propensity score matching: a conceptual review for radiology researchers. Korean J Radiol 2015;16(2):286–296.
  • 55. Haukoos JS, Lewis RJ. The propensity score. JAMA 2015;314(15):1637–1638.
  • 56. McDonald RJ, McDonald JS, Kallmes DF, Carter RE. Behind the numbers: propensity score analysis—a primer for the diagnostic radiologist. Radiology 2013;269(3):640–645.
  • 57. Halpern EF. Behind the numbers: inverse probability weighting. Radiology 2014;271(3):625–628.
  • 58. Park HJ, Jang JK, Park SH, et al. Restaging abdominopelvic computed tomography before surgery after preoperative chemoradiotherapy in patients with locally advanced rectal cancer. JAMA Oncoldoi.org/10.1001/jamaoncol.2017.4596. Published online November 27, 2017.
  • 59. Guyatt GH, Tugwell PX, Feeny DH, Drummond MF, Haynes RB. The role of before-after studies of therapeutic impact in the evaluation of diagnostic technologies. J Chronic Dis 1986;39(4):295–304.

Article History

Published online: Jan 08 2018
Published in print: Mar 2018