One Algorithm May Not Fit All: How Selection Bias Affects Machine Learning Performance
Abstract
Machine learning (ML) algorithms have demonstrated high diagnostic accuracy in identifying and categorizing disease on radiologic images. Despite the results of initial research studies that report ML algorithm diagnostic accuracy similar to or exceeding that of radiologists, the results are less impressive when the algorithms are installed at new hospitals and are presented with new images. This phenomenon is potentially the result of selection bias in the data that were used to develop the ML algorithm. Selection bias has long been described by clinical epidemiologists as a key consideration when designing a clinical research study, but this concept has largely been unaddressed in the medical imaging ML literature. The authors discuss the importance of selection bias and its relevance to ML algorithm development to prepare the radiologist to critically evaluate ML literature for potential selection bias and understand how it might affect the applicability of ML algorithms in real clinical environments.
©RSNA, 2020
SA-CME LEARNING OBJECTIVES
After completing this journal-based SA-CME activity, participants will be able to:
■ Discuss the importance of selection bias in ML research, specifically with regard to its effect on external validity.
■ Define relevant clinical epidemiology and ML terminology needed to critically assess ML studies for selection bias.
■ Recognize and address selection bias in ML research.
Introduction
Selection bias is a well-known consideration one should take into account when designing a clinical study, as it can affect whether study results will be applicable to a real patient population (1,2). However, selection bias has been largely unaddressed in the machine learning (ML) literature, despite growing evidence of its detrimental effect on algorithm performance when implemented at a new institution (3,4). As ML algorithms are adopted into the clinical radiology workflow, it is important to recognize the potential for selection bias when critically evaluating published studies of ML algorithms and deciding if these algorithms are ready for daily clinical practice. Keeping basic clinical epidemiology principles in mind will help to ensure that ML algorithms are evaluated with the same rigor as that of any other clinical diagnostic test or risk stratification tool.
In this article, we review relevant concepts and terminology in both clinical epidemiology and ML and how these concepts are related. We aim to provide an appreciation for the effect of selection bias in ML development and how it can be identified and addressed in clinical ML studies.
Relevant Terms from Clinical Epidemiology
There are several terms used in clinical epidemiology that are relevant to our discussion of selection bias. These terms are related to issues that arise from the design and interpretation of clinical research studies involving populations (Table 1).
![]() |
Target and Study Populations
The target population is defined as the set of all people to whom the results of a study will be applied. The target population of a study should be clearly defined and is usually based on clinical and demographic factors. The study population is defined as the subset of the target population that is available to a researcher, as it is usually impossible to study the entire target population. Even if the entire study population cannot be studied, a study sample may be selected.
For example, in a hypothetical study of the effect of daily low-dose aspirin treatment on heart attack prevention in African American adults, the target population is African American adults, the study population could be African American adults in the researcher’s city, and the study sample could be African American adults who see primary care providers at the researcher’s institution.
Selection Bias
Selection bias is defined by epidemiologists as error owing to systematic differences in characteristics between those participants who are selected for a study and those who are not (1,2).A source of confusion when using the term selection bias is the difference in usage between epidemiologists and clinical trialists. Epidemiologists view selection bias as one of three major categories of study bias, the other two being confounding bias and information bias (also known as measurement or misclassification bias) (5,6). Clinical trialists view selection bias as one of the six major categories in the Cochrane risk of bias tool, the other five categories being performance bias, detection bias, attrition bias, reporting bias, and other (5,7). The Cochrane definition of selection bias is the systematic differences between baseline characteristics of comparison groups in a study. This differs from the epidemiology definition in a subtle but crucial way: epidemiologists refer to selection bias as systematic differences between the entire study population and the intended target population rather than differences between comparison groups within the study population. The epidemiology definition does not involve an explicit control group and focuses on whether an observed association can be applied to the target population.
When discussing imaging-based ML research, we propose using the epidemiology definition of selection bias. While clinical trials are designed to investigate the effect of an intervention, epidemiologic studies investigate associations between exposure and disease. Similar to epidemiologic studies, much of medical imaging ML research focuses on disease associations—specifically associations of imaging features with disease diagnosis—rather than the effect of an intervention.
External and Internal Validity
External validity refers to whether the results of a study, derived from the study population, will remain valid when applied to the target population. If selection bias is present, a study might not be externally valid. In contrast, internal validity refers to whether a study was correctly designed to evaluate the association the researchers intended to study and whether the results accurately reflect the participants in the study.
To illustrate an internally valid study that may not be externally valid, consider a study on intravenous contrast material exposure and acute kidney injury performed in the inpatient setting. With careful design and data collection, such a study would be internally valid, but it may have limited external validity for diagnosing disease in outpatients, as patients admitted to the hospital typically have more comorbidities that could amplify the severity of kidney injury. Therefore, in this example, the inpatient study has limited generalizability, and its results should not be extrapolated and applied to the management of outpatients.
Relevant Terms from ML
ML researchers have their own terminology for many analogous concepts from clinical epidemiology, which we defined in the previous section. Confusion can arise because of subtle but important differences, notably when defining test data in ML compared to the target population in clinical epidemiology (Table 2).
![]() |
Training and Test Data
Training data, or the training set, is defined as the set of images fed into an ML algorithm from which the algorithm learns. A diagnostic ML algorithm uses these training data to identify distinguishing imaging features between diagnostic categories and to discover a weighting of those features that optimizes diagnostic performance. Everything the algorithm knows is drawn from the training data. Usually, a subset of the training data is identified as validation data, which are used for conducting preliminary tests during the training stage and guiding weighting adjustments.
Test data, or the test set, is defined as a set of images that have been excluded from the training data. They are unseen cases that are used to evaluate algorithm performance after training has been completed. The critical nuance to understand is that test data in the ML literature are usually not external data from an independent source. Instead, they are usually drawn from the same source of images used for the training data.
Generalization and Overfitting
Generalization in ML refers to whether the performance of the algorithm on the training set is maintained on the test set.Overfitting causes poor generalization. It occurs when an ML algorithm fits the training data too well, including not only salient features but also noise within the training set, resulting in diminished performance on unseen test data. A simplistic illustration of this concept is teaching a toddler where to put shirts and sweaters in her closet. After showing the toddler several times, the toddler can categorize her shirts and sweaters perfectly. However, if the toddler only practices with the same shirts and sweaters, she may simply memorize her exact shirts and sweaters (perhaps by color) and where each item goes in her closet. In this example, the toddler is overfitting rather than learning the difference between a shirt and a sweater.
Commonly used methods to reduce overfitting and thus improve generalization include regularization techniques (eg, early stop, drop out, and activation), cross validation, and data augmentation. While radiologists do not need to know the technical details of such methods, it is helpful to recognize the terms and know their place in algorithm development.
Clinical Epidemiology Concepts in ML: What Can We Learn?
Study and Target Populations versus Training and Test Data
In most ML studies, the study population is the ML dataset itself, which is most commonly split into training, validation, and test sets. The target population is often unstated or implied. The rapid advances in ML have in part been driven by the availability of large public datasets, but the clinical characteristics, clinical setting, and other selection criteria are often not available for these datasets. If the target population in an ML algorithm is poorly defined, it is difficult to judge whether the studied performance of the algorithm can be applied to a clinically relevant population, which may be different with respect to disease severity, disease comorbidities, risk factors, the distribution of imaging findings, the type of imaging equipment used, and the availability of specialized care.
External Validity versus Generalization
The term generalization in classic ML specifically refers to evaluation of internal validity, because test sets are drawn from the same study population as those of the training and validation sets. Evaluation of external validity requires data from a distribution of cases from the same target population but independent of the study population. When used, these external data are explicitly referred to as an external dataset or an external test set in the current medical imaging ML literature.
Although the terms “generalizing well,” “generalizable,” and “generalizability” are often used in the clinical literature to indicate external validity of results, generalization in ML is not the same as external validity (Figure). This difference is a potential source of confusion for radiologists who are exposed to both the clinical and ML literature. A recent commentary by the Radiology editorial board (8) on the evaluation of artificial intelligence research contains a recommendation for using external datasets from outside institutions as the final measure of performance to determine if the model “will generalize.” Thus, the use of this term in medical imaging ML may be shifting toward denoting external validity rather than internal validity. To avoid confusion, we recommend that any usage of terms derived from the root word “generalize” be accompanied by an indication of whether external or internal validity is meant.

Figure. Diagram demonstrates external validity versus generalization. Generalization (black arrow) refers to algorithm performance on the test set compared to on the training and validation datasets, all of which are obtained from the same study population. External validity (white arrow) refers to algorithm performance on an external dataset, which is separate from the original study population.
How to Address Selection Bias
It is well and good to present a detailed discussion of selection bias and its potential effect on external validity, but what can be done to address the problem? Two general strategies should be considered. First,
the obvious and most direct strategy to address selection bias is to employ an independent external dataset as the final test set.
The second strategy for addressing potential selection bias is indirect. It involves providing as much detail as possible about the study and target populations. This information enables the audience to judge if the study population was actually representative of the target population or, perhaps more importantly, whether the study population matches another population of interest, such as that at another institution.In imaging ML studies, the study population is usually the entire training dataset. Ideally, a full description of the dataset would include how cases were selected for the dataset, known to epidemiologists as the inclusion criteria. Inclusion criteria may involve specific demographic characteristics, geographic location, date range, clinical diagnoses, and clinical setting. Following the inclusion criteria is a description of the distribution of cases that are actually in the dataset. As with all clinical studies, the distribution should be described in terms of demographic variables (eg, age, gender, and race), comorbidities, disease prevalence, disease severity, and disease subtypes.
Imaging-based ML studies have additional factors to consider in the distribution, such as the spectrum of imaging manifestations of disease, the spectrum of normal appearances, and imaging equipment and protocols used (Table 3). An imbalance of any of these variables may lead to selection bias, if not properly stratified.
![]() |
Different spectrums of imaging manifestations in the study and target populations may lead to improper weighting of one imaging manifestation over another. For example, the spectrum of radiographic manifestations of osteomyelitis may include osseous erosion, localized demineralization, and localized periosteal reaction. An algorithm intended to detect osteomyelitis on radiographs may not be sensitive to all imaging manifestations of osteomyelitis if the training set contained mostly radiographs depicting only osseous erosion.
The spectrum of normal or nondisease appearances contained in the training data is just as crucial as the spectrum of imaging manifestations of disease in the data. An appropriate spectrum of nondisease may include images of similar-appearing disease processes, artifacts, and external medical devices. For example, in a study by Singh et al (4), the presence of infusion port catheters led to the detection of numerous false-positive lung abnormalities when the Qure.ai (Mumbai, India) chest radiograph algorithm, which was trained on millions of chest radiographs from hospitals in India, was installed at their institution in Boston, Mass. This lack of external validity occurred because the training data had few infusion ports, so the algorithm was unable to learn the distinction between a true lung opacity and a medical device.
Factors involving imaging equipment and protocols, such as the proportion of portable radiographs in a chest radiograph dataset, may also lead to selection bias and a lack of external validity. For example, a study by Zech et al (3) found that ML algorithms could accurately classify chest radiographs according to their institution of origin and even according to different services (inpatient service versus emergency department) within the same institution. If pneumothorax occurs with a higher prevalence in the emergency department than in the inpatient service, the idiosyncratic imaging features associated with each clinical service might be weighted more heavily than the actual disease-related features.
The recently published Checklist for Artificial Intelligence in Medical Imaging (CLAIM) (13) includes multiple items that cover both the direct and indirect methods that we describe to address selection bias. The direct method of testing algorithm performance in an external dataset is represented by CLAIM item 32, and the indirect method of including detailed dataset characteristics and eligibility criteria is addressed in items 7 and 8. However, not all characteristics associated with potential selection bias in imaging-based ML datasets are specified in the checklist. Not mentioned are imaging-specific factors (Table 3) such as the spectrum of pathologic imaging manifestations, spectrum of normal or nondisease appearances, and imaging equipment and protocol used.
Unique Challenges of Selection Bias in ML
The “black box” nature of ML algorithms leads to an additional challenge in identifying potential selection bias and assessing the external validity of these algorithms. Popular ML methods like deep learning do not require radiologists to identify relevant imaging features in the data. Instead, ML algorithms are able to identify distinguishing features and weight these features automatically.
The black box refers to our inability to be sure of which features the algorithm is using to make a diagnosis. The training data may be unbiased with respect to all of the demographic and clinical variables previously mentioned, but they may be biased with respect to unperceived or unexpected imaging features that the ML algorithm extracts from the training set. As a result, the ML algorithm may exhibit poor external validity.For example, consider a situation in which all breast cancer cases in a mammographic dataset happen to have the same lead image marker. The presence of the image marker will become incorporated into and heavily weighted in the algorithm, although this feature is not actually relevant to the diagnosis of breast cancer. While techniques such as class activation mapping (14) can reveal some of the more obvious anomalies, more subtle high-level features that deep learning algorithms discover and employ remain unknown, making identification of possible sources of selection bias especially problematic.
Another significant challenge in controlling selection bias in medical imaging ML studies is the increasing availability and utilization of large public-access imaging datasets. For technical reasons or confidentiality, these large datasets often provide few of the demographic and clinical variables that would be necessary to detect selection bias and assess external validity.
Conclusion
Understanding the concepts and terminology from clinical epidemiology not only highlights the challenges that ML applications face in assessing real-world clinical populations, but it also helps suggest steps to meet these challenges. When selection bias is present in training data, the resultant external validity of the ML algorithm is adversely affected, and performance may be diminished for new cases drawn from the same target population. This limitation can be overcome through the use of external test data, explicit definition of target populations, and detailed descriptions of the clinical, demographic, and other characteristics of training data.
Presented as an education exhibit at the 2019 RSNA Annual Meeting.
For this journal-based SA-CME activity, the authors, editor, and reviewers have disclosed no relevant relationships.
References
- 1. . A Dictionary of Epidemiology. Oxford, England: Oxford University Press, 2001. Google Scholar
- 2. . Basic Epidemiology. Geneva, Switzerland: World Health Organization, 2006. Google Scholar
- 3. . Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018;15(11):e1002683. Crossref, Medline, Google Scholar
- 4. Deep learning in chest radiography: Detection of findings and presence of change. PLoS One 2018;13(10):e0204155. Crossref, Medline, Google Scholar
- 5. . Biases in Randomized Trials: A Conversation Between Trialists and Epidemiologists. Epidemiology 2017;28(1):54–59. Crossref, Medline, Google Scholar
- 6. . Validity and Generalizability in Epidemiologic Studies. In: Encyclopedia of Biostatistics. Chichester, England: Wiley, 2005. Crossref, Google Scholar
- 7. ; Cochrane Statistical Methods Group. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ 2011;343:d5928. Crossref, Medline, Google Scholar
- 8. Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers—From the Radiology Editorial Board. Radiology 2020;294(3):487–489. Link, Google Scholar
- 9. . Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: Results from recently published papers. Korean J Radiol 2019;20(3):405–410. Crossref, Medline, Google Scholar
- 10. . Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 2018;286(3):800–809. Link, Google Scholar
- 11. . Artificial intelligence for medical image analysis: A guide for authors and reviewers. AJR Am J Roentgenol 2019;212(3):513–519. Crossref, Medline, Google Scholar
- 12. Peering into the black box of artificial intelligence: Evaluation metrics of machine learning methods. AJR Am J Roentgenol 2019;212(1):38–43. Crossref, Medline, Google Scholar
- 13. . Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell 2020;2(2):e200029. Link, Google Scholar
- 14. . Learning Deep Features for Discriminative Localization. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 2016; 2921–2929. Google Scholar
Article History
Received: Mar 15 2020Revision requested: May 7 2020
Revision received: May 22 2020
Accepted: May 28 2020
Published online: Sept 25 2020
Published in print: Nov 2020












