Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction
Abstract
In this article, we review some of the key methodologic points that should be considered with regard to clinical evaluation of artificial intelligence tools for use in medical diagnosis and prediction.
The use of artificial intelligence in medicine is currently an issue of great interest, especially with regard to the diagnostic or predictive analysis of medical images. Adoption of an artificial intelligence tool in clinical practice requires careful confirmation of its clinical utility. Herein, the authors explain key methodology points involved in a clinical evaluation of artificial intelligence technology for use in medicine, especially high-dimensional or overparameterized diagnostic or predictive models in which artificial deep neural networks are used, mainly from the standpoints of clinical epidemiology and biostatistics. First, statistical methods for assessing the discrimination and calibration performances of a diagnostic or predictive model are summarized. Next, the effects of disease manifestation spectrum and disease prevalence on the performance results are explained, followed by a discussion of the difference between evaluating the performance with use of internal and external datasets, the importance of using an adequate external dataset obtained from a well-defined clinical cohort to avoid overestimating the clinical performance as a result of overfitting in high-dimensional or overparameterized classification model and spectrum bias, and the essentials for achieving a more robust clinical evaluation. Finally, the authors review the role of clinical trials and observational outcome studies for ultimate clinical verification of diagnostic or predictive artificial intelligence tools through patient outcomes, beyond performance metrics, and how to design such studies.
© RSNA, 2018
References
- 1. . Machine learning and prediction in medicine: beyond the peak of inflated expectations. N Engl J Med 2017;376(26):2507–2509.
- 2. . Unintended consequences of machine learning in medicine. JAMA 2017;318(6):517–518.
- 3. . Deep learning in medical imaging: general overview. Korean J Radiol 2017;18(4):570–584.
- 4. . Implementing machine learning in radiology practice and research. AJR Am J Roentgenol 2017;208(4):754–760.
- 5. . Deep learning for medical image analysis. San Diego, Calif: Elsevier, 2017.
- 6. . Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316(22):2402–2410.
- 7. . Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115–118.
- 8. . Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017;318(22):2199–2210.
- 9. . Relevance of deep learning to facilitate the diagnosis of HER2 status in breast cancer. Sci Rep 2017;7:45938.
- 10. . Predicting non–small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun 2016;7:12474.
- 11. . Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017;284(2):574–582.
- 12. . Automated critical test findings identification and online notification system using artificial intelligence in imaging. Radiology 2017;285(3):923–931.
- 13. . Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: a preliminary study. Radiology doi: 10.1148/radiol.2017170706. Published online October 27, 2017.
- 14. An intuitive explanation of convolutional neural networks. The data science blog. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets. Accessed August 11, 2017.
- 15. . Deep learning. Cambridge, Mass: MIT Press, 2016.
- 16. Enhancing the quality and transparency of health research. EQUATOR Network website. http://www.equator-network.org. Accessed October 5, 2017.
- 17. . Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350(jan07 4):g7594.
- 18. . The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, NY: Springer, 2009.
- 19. . Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J Radiol 2004;5(1):11–18.
- 20. . Receiver operating characteristic curves and their use in radiology. Radiology 2003;229(1):3–8.
- 21. . ROC analysis. AJR Am J Roentgenol 2005;184(2):364–372.
- 22. . Evaluating discrimination of risk prediction models: the C statistic. JAMA 2015;314(10):1063–1064.
- 23. . Index for rating diagnostic tests. Cancer 1950;3(1):32–35.
- 24. . Receiver operating characteristic (ROC) curve analysis. In: MedCalc manual: easy-to-use statistical software. Ostend, Belgium: MedCalc Software, 2017; 222.
- 25. . Evaluation of performance. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 255–279.
- 26. . How to develop, validate, and compare clinical prediction models involving radiological parameters: study design and statistical methods. Korean J Radiol 2016;17(3):339–350.
- 27. . A comparison of goodness-of-fit tests for the logistic regression model. Stat Med 1997;16(9):965–980.
- 28. . Overfitting and optimism in prediction models. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 83–99.
- 29. . The potential of radiomic-based phenotyping in precision medicine: a review. JAMA Oncol 2016;2(12):1636–1642.
- 30. . Validation of prediction models. In: Clinical prediction models: a practical approach to development, validation, and updating. New York, NY: Springer, 2010; 299–310.
- 31. . Assessing the generalizability of prognostic information. Ann Intern Med 1999;130(6):515–524.
- 32. . Bias in research studies. Radiology 2006;238(3):780–789.
- 33. . STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 2015;277(3):826–832.
- 34. . The two-step Fagan’s nomogram: ad hoc interpretation of a diagnostic test result without calculation. Evid Based Med 2013;18(4):125–128.
- 35. . Nomogram for Bayes theorem [letter]. N Engl J Med 1975;293(5):257.
- 36. . QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155(8):529–536.
- 37. . Facilitating prospective registration of diagnostic accuracy studies: a STARD initiative. Clin Chem 2017;63(8):1331–1341.
- 38. . Sample size estimation: how many individuals should be studied? Radiology 2003;227(2):309–313.
- 39. . Minimal clinically important difference: defining what really matters to patients. JAMA 2014;312(13):1342–1343.
- 40. . How to demonstrate similarity by using noninferiority and equivalence statistical testing in radiology research. Radiology 2013;267(2):328–338.
- 41. . Sample size calculations in studies of test accuracy. Stat Methods Med Res 1998;7(4):371–392.
- 42. U.S. Food and Drug Administration website. https://www.accessdata.fda.gov/cdrh_docs/pdf16/k163253.pdf. Accessed October 7, 2017.
- 43. U.S. Food and Drug Administration website. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpma/pma_template.cfm?id=p150043. Accessed October 7, 2017.
- 44. . Computerised interpretation of fetal heart rate during labour (INFANT): a randomised controlled trial. Lancet 2017;389(10080):1719–1729.
- 45. ; INFANT Collaborative Group. A study of an intelligent system to support decision making in the management of labour using the cardiotocograph: the INFANT study protocol. BMC Pregnancy Childbirth 2016;16(1):10.
- 46. . Designing studies of medical tests. In: Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing clinical research. 4th ed. Philadelphia, Pa: Lippincott Williams & Wilkins, 2013; 171–191.
- 47. . Computerised cardiotocography: study design hampers findings. Lancet 2017;389(10080):1674–1676.
- 48. . Uncontrolled before-after studies: discouraged by Cochrane and the EMJ. Emerg Med J 2015;32(7):507–508.
- 49. . Can additional experienced staff reduce emergency medical admissions? Emerg Med J 2004;21(1):51–53.
- 50. . Efficacy of infant simulator programmes to prevent teenage pregnancy: a school-based cluster randomised controlled trial in Western Australia. Lancet 2016;388(10057):2264–2271.
- 51. . The diagnostic before-after study to assess clinical impact. In: Knottnerus JA, Buntinx F, eds. The evidence base of clinical diagnosis: theory and methods of diagnostic research. 2nd ed. West Sussex, England: Wiley-Blackwell, 2009; 83–95.
- 52. . Enhancing causal inference in observational studies. In: Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing clinical research. 4th ed. Philadelphia, Pa: Lippincott Williams & Wilkins, 2013; 117–136.
- 53. . Confounding by indication in clinical research. JAMA 2016;316(17):1818–1819.
- 54. . Propensity score matching: a conceptual review for radiology researchers. Korean J Radiol 2015;16(2):286–296.
- 55. . The propensity score. JAMA 2015;314(15):1637–1638.
- 56. . Behind the numbers: propensity score analysis—a primer for the diagnostic radiologist. Radiology 2013;269(3):640–645.
- 57. . Behind the numbers: inverse probability weighting. Radiology 2014;271(3):625–628.
- 58. . Restaging abdominopelvic computed tomography before surgery after preoperative chemoradiotherapy in patients with locally advanced rectal cancer. JAMA Oncoldoi.org/10.1001/jamaoncol.2017.4596. Published online November 27, 2017.
- 59. . The role of before-after studies of therapeutic impact in the evaluation of diagnostic technologies. J Chronic Dis 1986;39(4):295–304.
Article History
Published online: Jan 08 2018Published in print: Mar 2018