Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations
There is great interest in developing artificial intelligence (AI)–based computer-aided detection (CAD) systems for use in screening mammography. Comparative performance benchmarks from true screening cohorts are needed.
To determine the range of human first-reader performance measures within a population-based screening cohort of 1 million screening mammograms to gauge the performance of emerging AI CAD systems.
Materials and Methods
This retrospective study consisted of all screening mammograms in women aged 40–74 years in Stockholm County, Sweden, who underwent screening with full-field digital mammography between 2008 and 2015. There were 110 interpreting radiologists, of whom 24 were defined as high-volume readers (ie, those who interpreted more than 5000 annual screening mammograms). A true-positive finding was defined as the presence of a pathology-confirmed cancer within 12 months. Performance benchmarks included sensitivity and specificity, examined per quartile of radiologists’ performance. First-reader sensitivity was determined for each tumor subgroup, overall and by quartile of high-volume reader sensitivity. Screening outcomes were examined based on the first reader’s sensitivity quartile with 10 000 screening mammograms per quartile. Linear regression models were fitted to test for a linear trend across quartiles of performance.
A total of 418 041 women (mean age, 54 years ± 10 [standard deviation]) were included, and 1 186 045 digital mammograms were evaluated, with 972 899 assessed by high-volume readers. Overall sensitivity was 73% (95% confidence interval [CI]: 69%, 77%), and overall specificity was 96% (95% CI: 95%, 97%). The mean values per quartile of high-volume reader performance ranged from 63% to 84% for sensitivity and from 95% to 98% for specificity. The sensitivity difference was very large for basal cancers, with the least sensitive and most sensitive high-volume readers detecting 53% and 89% of cancers, respectively (P < .001).
Benchmarks showed a wide range of performance differences between high-volume readers. Sensitivity varied by tumor characteristics.
© RSNA, 2020
Overall sensitivity of breast cancer detection in a screening cohort ranged from 63% to 84% for the least sensitive quartile to the most sensitive quartile for high-volume radiologists; the largest sensitivity difference occurred with basal cancers.
■ Sensitivity was between 63% for the least sensitive and 84% for the most sensitive quartile of high-volume readers (those who interpreted more than 5000 annual screening mammograms).
■ Specificity was between 95% for the least specific quartile of high-volume readers and 98% for the most specific quartile.
■ The sensitivity difference was very large for basal cancers, with the least sensitive and most sensitive high-volume readers detecting 53% and 89% of cancers, respectively (P < .001).
Breast cancer is the most common cancer type among women in the Western world, and its incidence is increasing (1). The mortality reduction resulting from mammography screening was estimated at 20% in a meta-analysis and up to 40% in a single study (2,3). For women who undergo biennial screening, the interval cancer rate is around 28% (4). These interval cancers are more aggressive and have a higher mortality rate (5).
Even if breast cancer screening programs deliver important health benefits, the overall costs are substantial owing to the large number of participants. The estimated cost of mammography screening in the United States in 2010 was $7.8 billion (6). To enhance screening performance, computer-aided detection (CAD) has been developed to facilitate detection of tumors in screening mammograms. However, traditional CAD, based on human-specified formulas for identification of tumors, results in many false-positive findings per image (7). There is increasing optimism for use of image analysis methods based on artificial intelligence (AI)—specifically deep neural networks—to reach human-level performance in the detection of suspicious findings on mammograms (8,9).
AI CAD systems might increase diagnostic performance and lower costs through use as stand-alone readers. However, before AI CAD systems for screening mammography can be clinically implemented, they must be validated and their performance compared with that of radiologists. Since the introduction of full-field digital mammography, there have been a few studies of radiologist performance in large screening cohorts, but none of them have divided performance into quartiles, and none have been performed in a European setting (10–13).
Our aim with this study was to establish performance benchmarks of human-level performance for comparison with stand-alone AI CAD systems as a replacement for the first reader in a geographically defined population-based screening cohort of around 1 million screening mammograms.
Materials and Methods
The ethics review board approved the research for this retrospective study and waived the need for written informed consent.
Our retrospective multicenter cohort study was performed with anonymized full-field digital mammograms obtained during screening mammography. The study population was the Swedish Cohort of Screen-Aged Women, which consisted of all women aged 40–74 in Stockholm County who were invited for screening mammograms from 2008 to 2015. Screening was performed every 18–24 months, with a 70% participation rate. All participants in the current study had been included in a prior publication that described the collection method of the Swedish Cohort of Screen-Aged Women and gave an overview of the population characteristics (14). The prior article dealt with an overall description of the data set, whereas here we report relevant benchmarks after having examined the performance of high-volume readers. Data on cancer diagnoses including tumor characteristics and radiologic assessments were obtained through linkage with the Regional Cancer Centre Stockholm Gotland breast cancer quality and screening registry; the Swedish personal identification numbers were used.
We included all consecutive screening mammographic examinations performed between January 1, 2008, and September 31, 2015. Women were considered to have had breast cancer at the time of screening if the breast cancer quality registry indicated the diagnosis within 12 months after the mammogram had been obtained.
High- versus Low-Volume Readers
There were 110 known interpreting radiologists. We distinguished between low- and high-volume readers based on the number of annual screening mammograms: those who read less than 5000 screening mammograms for at least 1 year within the time range of our data set were considered low-volume readers, and those who read 5000 or more were considered high-volume readers.
Mammography Screening System
The general mammography screening system in Sweden includes two-view mammography of each breast assessed with double reading. If a suspicious finding is observed by one or both radiologists, the mammogram gets flagged for consensus discussion. If the consensus discussion concludes that there is radiologic suspicion for an abnormal lesion, the woman is recalled. Another reason for recall is if the woman reports breast symptoms at the time of screening. The screening assessments were performed at five different institutions—two private and three public. The total number of screening mammograms ranged from 151 360 to 344 306. Mammography devices from Hologic (Marlborough, Mass), Philips (Amsterdam, the Netherlands), Sectra (Linkoping, Sweden), Fujifilm (Minato, Japan), GE Healthcare (Boston, Mass), and Siemens (Munich, Germany) were used.
We determined performance levels for each radiologist for the following measures: sensitivity, specificity, abnormal interpretation rate, cancer detection rate, false-negative rate, accuracy, and positive predictive value. Measures based on first-reader assessments form the basis for the main tables, whereas measures based on second-reader and consensus assessments can be found in Table E1 (online). For the false-negative screenings, the cancer was detected either by the second reader or clinically (ie, interval cancer). In Sweden, clinical detection is mainly based on self-detection because clinical breast examination of healthy women is rarely performed. The mean numbers reported in all tables for first and second readers were calculated as the average of radiologist-level measures, whereas the mean number reported for the consensus discussion was calculated as the overall average (because the consensus discussion involved various combinations of radiologists). To make a more robust estimate of the individual measures, we grouped into a single-radiologist category all examinations by radiologists who identified fewer than 10 cancers among the mammograms they had assessed. We reported performance benchmarks overall and for each quartile of high-volume readers, wherein quartile one (Q1) included the worst-performing ones and quartile four (Q4) included the best-performing ones for each metric. Minimum and maximum values per quartile were determined. Based on a screening population of 10 000 women, we calculated screening outcomes separately for each quartile of high-volume readers grouped according to sensitivity levels to enable future comparisons with AI CAD systems operating at different sensitivity levels. Finally, to facilitate nuanced future evaluations of AI CAD systems, we determined specific sensitivity levels for various tumor characteristics: molecular subtype and histologic type including tumor size and invasiveness. The tumor size was based on pathologic measurements and was classified as minimal (≤10 mm), small (11–19 mm), or large (≥20 mm). Invasiveness was classified based on pathologic analysis as either in situ only or invasive (with possible in situ components). The histologic type was divided into ductal, lobular, or other. Ductal type included ductal-only tumors and mixed tumors with a ductal component. The molecular subtypes were defined according to the St Gallen classification (15).
Computer software (Stata Statistical Software, release 15.1; Stata, College Station, Tex) was used for all statistical analyses. All statistical tests were two sided. The level for statistical significance was set at P < .05. To test for an association across quartiles of radiologist performance, we fitted linear regression models with the quartile as the predictor. The 95% confidence interval (CI) for sensitivity and specificity was estimated through bootstrapping.
Participant and Mammogram Characteristics
From 2008 to 2015, 504 566 women were invited to undergo screening; 83 225 did not participate and thus were not part of our study population. We excluded 23 033 screening mammograms in 3300 women owing to unknown radiologist identification. The final study population included 1 186 045 screening mammograms in 418 041 women; 972 899 mammograms were read by high-volume readers, and the remaining 213 146 were read by low-volume readers. The study population flowchart is presented in Figure 1. Participant and cancer characteristics are shown in Table 1. Among women who underwent screening, 4723 were diagnosed with breast cancer at screening or within 12 months thereafter. There were 3514 true-positive screenings, 1209 false-negative screenings, 1 138 619 true-negative screenings, and 41 969 false-positive screenings. Example images are shown in Figure 2. The mean age at screening was 54 years ± 9.5 (standard deviation), and the mean age at diagnosis was 59 years ± 10.1.
Reader Performance Measures
Table 2 reports first-reader performance measures. There were 24 high-volume readers and 86 low-volume readers (Table E3 [online]). For high-volume readers, the maximum, mean, and minimum annualized volumes were 19 850 mammograms, 8243 mammograms, and 3168 mammograms, respectively. Sensitivities were 73% (95% CI: 69%, 77%), 75% (95% CI: 67%, 81%), and 73% (95% CI: 68%, 77%) for all, low-volume, and high-volume readers, respectively. The corresponding values for specificity were 96% (95% CI: 95%, 97%), 97% (95% CI: 94%, 97%), and 96% (95% CI: 95%, 97%), respectively. With regard to the performance of high-volume readers, the mean values per quartile ranged from 63% to 84% for sensitivity and from 95% to 98% for specificity. The percentage of cancers that were missed by first readers but were detected by second readers was 44%, and the percentage of cancers that were detected through clinical detection within 12 months was 56%. Performance measures for second readers and for the consensus discussion can be found in Table E1 (online); sensitivities were 80% and 85% for second readers and consensus discussion, respectively. Performance is further stratified according to mammographic density and age in Table E2 (online).
In Table 3, we examine the screening outcomes according to sensitivity quartile of the first reader based on 10 000 screening mammograms per quartile. For the most sensitive radiologists (Q4), 45 cancers were diagnosed at screening or within 12 months thereafter, of which 37 (82%) were detected and seven were missed by the first reader (per 10 000 screened). For the least sensitive radiologists (Q1), there were 38 cancers, of which 24 (63%) were detected and 14 were missed (per 10 000 screened). There were 497 and 281 abnormal interpretations for Q4 and Q1, respectively (per 10 000 screenings).
Reader Sensitivity for Each Tumor Subgroup
First-reader sensitivity for each tumor subgroup is shown in Table 4. The results are presented overall and by quartile of sensitivity of high-volume readers (each quartile contains the same radiologists as those in Table 3). The overall sensitivity for high-volume readers was 77% for ductal cancers and 73% for lobular cancers, 76% for invasive cancers and 83% for in situ only (any grade) cancers, 77% for luminal A, and 69% for basal cancers. Analysis by quartile of high-volume readers showed that the sensitivities for the most (Q4) and the least (Q1) sensitive high-volume readers were 85% and 67%, respectively, for ductal cancers; 84% and 63%, respectively, for lobular cancers; 85% and 67%, respectively, for all invasive cancers; 93% and 75%, respectively, for in situ cancers only (any grade); 99% and 75%, respectively, for high-grade in situ cancers only; 85% and 67%, respectively, for luminal A cancers; and 89% and 53%, respectively, for basal cancers (P < .001 for the comparison of sensitivity levels between quartiles for each of the aforementioned subgroups).
Our multicenter study provided performance benchmarks for first readers of screening mammograms in Sweden based on more than 1 million full-field digital mammograms acquired between 2008 and 2015. For first readers, second readers, and consensus discussion, the sensitivities were 73%, 80%, and 85%, respectively, and the specificities were 96%, 96%, and 98%, respectively. For high-volume (≥5000 mammograms per year) first readers, sensitivity ranged from 63% to 84% for the least sensitive and most sensitive quartile of radiologists, respectively. In a standardized screening population of 10 000 women, the most sensitive quartile of radiologists and the least sensitive quartile would miss seven and 14 cancers, respectively, that were later detected by the second reader or through clinical detection during 12 months after screening. Top sensitive radiologists were consistently better across tumor subgroups.
When our high-volume first-reader results are compared with the most recent update from the Breast Cancer Surveillance Consortium, the sensitivity is lower (73% vs 87%) and the specificity is higher (96% vs 89%) (13). Even though the first-reader sensitivity is markedly lower, the entire process with double reading shows a sensitivity of 85% after consensus discussion. Higher sensitivity and lower specificity in the United States compared with European countries is well known and has been stated in previous studies (16). The main analysis of this study was performed without considering that often there were repeated observations over time for each woman. As a secondary analysis, we fitted a generalized estimating equation, with repeated measures taken into account. This showed only a minimal difference in estimated sensitivity (lower by 0.3 percentage points) and no change in specificity.
Both sensitivity and specificity were similar between low- and high-volume readers. There have been conflicting data on the association between the experience level of radiologists and their accuracy in interpreting mammograms (17–19). However, we observed large differences in sensitivity between individual high-volume readers. Major differences in mammography assessment accuracy were also reported in a recent study that showed a range of sensitivities between 76% and 84% and a range of specificities between 49% and 79% (9).
Our analysis revealed a lower number of first-reader false-negative screenings (ie, potential interval cancers) in readers operating at a higher sensitivity level at the expense of a higher rate of abnormal interpretations (ie, potential recalls). These findings are in line with the association between increased recall rate and decreased interval cancer rate determined in a prior study (20).
The aforementioned studies determined performance benchmarks, but none examined sensitivity differences based on tumor characteristics. The top quartile of radiologists achieved a sensitivity level of 99% for high-grade in situ cancers compared with 85% for invasive cancers. The higher sensitivity for in situ cancers might be explained by their mammographic manifestation of calcifications, which are relatively easy to see, as well as a frequent absence of clinical manifestations that are less likely to be detected (21–23). Lobular cancers are detected at a larger size than ductal cancers, both at screening and clinically, based on symptoms. The sensitivity figures reported in our study concern the ability of screening mammography to facilitate detection of cancers otherwise detected clinically within 12 months. Cancers that are hard to detect clinically, such as lobular cancers or small cancers, will have an inflated sensitivity measure (because the cancers will not cause clinical symptoms until after 12 months have elapsed). Our observation that first-reader sensitivity for diagnosing larger tumors is lower than for smaller tumors might in part be explained by the fact that larger tumor size may be associated with an increased probability that women will self-detect the tumor during the 12-month interval after a negative screening result. In our study, the most sensitive radiologists were better across most tumor characteristics. The largest performance difference was observed for the basal molecular subtype, in which the least sensitive quartile of radiologists detected only 53% of cancers. Basal cancers often have benign or indeterminate mammographic findings, potentially making them easier to miss (24).
There is much interest in developing and implementing AI-based computer-aided CAD systems (9). Our results showed that the performance requirements for an AI CAD system are considerably higher for replacing both double reading and consensus discussion than merely replacing the first reader.
Our study had some limitations. The generalizability was limited by the ethnic composition of Stockholm, screening age range of 40–74 years, biennial screening schedule, double-reading setting, breast density, and some missing data regarding tumor characteristics, such as tumor histologic and molecular subtypes. Also, tumor size was based on pathologic measurements without consideration of neoadjuvant therapy, so the size at the time of screening could have been larger than at the time of pathologic postoperative assessment. This should not affect the comparison across reader quartiles but could affect the absolute performance measures. Our study was conducted based on two-dimensional screening mammography, and the corresponding metrics in a three-dimensional mammographic setting are likely to be different.
In conclusion, our study determined a range of screening mammography benchmarks, which can be useful in evaluating the performance and choosing the operating point of standalone artificial intelligence–based computer-aided detection systems.Disclosures of Conflicts of Interest: M.S. disclosed no relevant relationships. K.D. disclosed no relevant relationships. M.E. disclosed no relevant relationships. P.L. disclosed no relevant relationships. F.S. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is a consultant for Collective Minds Radiology. Other relationships: disclosed no relevant relationships.
Author contributions: Guarantor of integrity of entire study, F.S.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, M.S., F.S.; clinical studies, M.S., P.L., F.S.; statistical analysis, F.S., M.S., M.E.; and manuscript editing, M.S., F.S.
Supported by Stockholm County Council (Dnr 20170802).
- 1. . International Variation in Female Breast Cancer Incidence and Mortality Rates. Cancer Epidemiol Biomarkers Prev 2015;24(10):1495–1506. Crossref, Medline, Google Scholar
- 2. . Screening for breast cancer in 2018-what should we be doing today? Curr Oncol 2018;25(Suppl 1):S115–S124. Crossref, Medline, Google Scholar
- 3. . The benefits and harms of breast cancer screening: an independent review. Br J Cancer 2013;108(11):2205–2240. Crossref, Medline, Google Scholar
- 4. A pooled analysis of interval cancer rates in six European countries. Eur J Cancer Prev 2010;19(2):87–93. Crossref, Medline, Google Scholar
- 5. Risk factors and tumor characteristics of interval cancers by mammographic density. J Clin Oncol 2015;33(9):1030–1037. Crossref, Medline, Google Scholar
- 6. . Aggregate cost of mammography screening in the United States: comparison of current practice and advocated guidelines. Ann Intern Med 2014;160(3):145. Crossref, Medline, Google Scholar
- 7. Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection. JAMA Intern Med 2015;175(11):1828–1837. Crossref, Medline, Google Scholar
- 8. Predicting Breast Cancer by Applying Deep Learning to Linked Health Records and Mammograms. Radiology 2019;292(2):331–342. Link, Google Scholar
- 9. Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison With 101 Radiologists. J Natl Cancer Inst 2019;111(9):916–922. Crossref, Medline, Google Scholar
- 10. Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy. Radiology 2009;253(3):641–651. Link, Google Scholar
- 11. Performance benchmarks for screening mammography. Radiology 2006;241(1):55–66. Link, Google Scholar
- 12. . Recall and Cancer Detection Rates for Screening Mammography: Finding the Sweet Spot. AJR Am J Roentgenol 2017;208(1):208–213. Crossref, Medline, Google Scholar
- 13. National Performance Benchmarks for Modern Screening Digital Mammography: Update from the Breast Cancer Surveillance Consortium. Radiology 2017;283(1):49–58. Link, Google Scholar
- 14. . A Multi-million Mammography Image Dataset and Population-Based Screening Cohort for the Training and Evaluation of Deep Neural Networks-the Cohort of Screen-Aged Women (CSAW). J Digit Imaging 2020;33(2):408–413. Crossref, Medline, Google Scholar
- 15. Personalizing the treatment of women with early breast cancer: highlights of the St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2013. Ann Oncol 2013;24(9):2206–2223. Crossref, Medline, Google Scholar
- 16. Cross-national comparison of screening mammography accuracy measures in U.S., Norway, and Spain. Eur Radiol 2016;26(8):2520–2528. Crossref, Medline, Google Scholar
- 17. . Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003;95(4):282–290. Crossref, Medline, Google Scholar
- 18. . Association between radiologists’ experience and accuracy in interpreting screening mammograms. BMC Health Serv Res 2008;8(1):91. Crossref, Medline, Google Scholar
- 19. Improving the accuracy of mammography: volume and outcome relationships. J Natl Cancer Inst 2002;94(5):369–375. Crossref, Medline, Google Scholar
- 20. . Association between Screening Mammography Recall Rate and Interval Cancers in the UK Breast Cancer Service Screening Program: A Cohort Study. Radiology 2018;288(1):47–54. Link, Google Scholar
- 21. Breast tumor characteristics as predictors of mammographic detection: comparison of interval- and screen-detected cancers. J Natl Cancer Inst 1999;91(23):2020–2028. Crossref, Medline, Google Scholar
- 22. Detection of ductal carcinoma in situ in women undergoing screening mammography. J Natl Cancer Inst 2002;94(20):1546–1554. Crossref, Medline, Google Scholar
- 23. . Ductal carcinoma in situ: mammographic findings and clinical implications. Radiology 1989;170(2):411–415. Link, Google Scholar
- 24. . Multimodality imaging of triple receptor-negative tumors with mammography, ultrasound, and MRI. AJR Am J Roentgenol 2010;194(4):1160–1166. Crossref, Medline, Google Scholar
Article HistoryReceived: Oct 3 2019
Revision requested: Nov 4 2019
Revision received: May 25 2020
Accepted: June 5 2020
Published online: July 28 2020
Published in print: Oct 2020