AI-based Strategies to Reduce Workload in Breast Cancer Screening with Mammography and Tomosynthesis: A Retrospective Evaluation
The workflow of breast cancer screening programs could be improved given the high workload and the high number of false-positive and false-negative assessments.
To evaluate if using an artificial intelligence (AI) system could reduce workload without reducing cancer detection in breast cancer screening with digital mammography (DM) or digital breast tomosynthesis (DBT).
Materials and Methods
Consecutive screening-paired and independently read DM and DBT images acquired from January 2015 to December 2016 were retrospectively collected from the Córdoba Tomosynthesis Screening Trial. The original reading settings were single or double reading of DM or DBT images. An AI system computed a cancer risk score for DM and DBT examinations independently. Each original setting was compared with a simulated autonomous AI triaging strategy (the least suspicious examinations for AI are not human-read; the rest are read in the same setting as the original, and examinations not recalled by radiologists but graded as very suspicious by AI are recalled) in terms of workload, sensitivity, and recall rate. The McNemar test with Bonferroni correction was used for statistical analysis.
A total of 15 987 DM and DBT examinations (which included 98 screening-detected and 15 interval cancers) from 15 986 women (mean age ± standard deviation, 58 years ± 6) were evaluated. In comparison with double reading of DBT images (568 hours needed, 92 of 113 cancers detected, 706 recalls in 15 987 examinations), AI with DBT would result in 72.5% less workload (P < .001, 156 hours needed), noninferior sensitivity (95 of 113 cancers detected, P = .38), and 16.7% lower recall rate (P < .001, 588 recalls in 15 987 examinations). Similar results were obtained for AI with DM. In comparison with the original double reading of DM images (222 hours needed, 76 of 113 cancers detected, 807 recalls in 15 987 examinations), AI with DBT would result in 29.7% less workload (P < .001), 25.0% higher sensitivity (P < .001), and 27.1% lower recall rate (P < .001).
Digital mammography and digital breast tomosynthesis screening strategies based on artificial intelligence systems could reduce workload up to 70%.
Published under a CC BY 4.0 license.
Digital mammography and digital breast tomosynthesis screening strategies based on artificial intelligence systems could reduce workload up to 70% without reducing sensitivity by 5% or more.
■ In a retrospective study of 15 987 mammograms, artificial intelligence (AI) reduced screening workload up to 70% for both digital mammography (DM)– or digital breast tomosynthesis (DBT)–based screening programs without reducing sensitivity by 5% or more.
■ Using AI to transition from DM screening to DBT screening would yield a reduction of 30% in workload, a 25% improvement in sensitivity, and a reduction of 27% in recall rate.
Earlier detection by means of mammography-based screening results in a 20%–35% reduction in breast cancer mortality (1,2). Consequently, breast cancer screening programs have been established in many countries to diagnose the disease as early as possible.
However, mammography-based screening programs are subject to some limitations. First, mammography sensitivity is lower with higher breast density (3). This leads to up to 20%–30% of breast cancers not being detected during screening and later manifesting symptomatically as interval cancers (4). Second, it is estimated that at least one out of three women participating in screening will have a false-positive recall during her lifetime (5), which not only adds harm to women but also increases the cost and workload of health care systems.
In general, the screening workload is high. The vast majority of mammograms in asymptomatic women will have a normal outcome, and no further action will be taken (6). Double reading detects between 9% and 20% more breast cancers, but it also results in more false-positive recalls and adds extra reading workload to screening (4).
More recently, digital breast tomosynthesis (DBT) has been shown to improve breast cancer screening detection rates by 30%–90% compared with digital mammography (DM), with a diverse impact on recall rate (7–10). Nevertheless, reading a DBT examination approximately doubles reading time for radiologists, which could be a barrier for implementing DBT as a screening modality in some settings (7,10–13).
In recent years, deep learning–based artificial intelligence (AI) systems have been quickly evolving in the field of breast imaging, surpassing the performance and clinical value of traditional computer-aided detection systems for mammography (14). Some of these systems can automatically detect breast cancer in two-dimensional mammograms and DBT images, with a performance level comparable with that of radiologists (15–17).
Some studies have investigated whether AI systems can be used in screening programs to reduce radiologists’ workload without negatively affecting the quality of outcomes (18–20). However, these are limited and have only investigated the use of AI to reduce workload in DM-based screening programs.
In this study, we retrospectively evaluate how AI could be used to reduce workload without reducing cancer detection in different screening settings, whether the screening is based on single or double reading of DM or DBT images.
Materials and Methods
This retrospective study was compliant with the Health Insurance Portability and Accountability Act. The study included anonymized and retrospectively collected screening examinations. Women were included from a single institution. The retrospective use of these anonymized data was approved by our hospital’s institutional review board, and the requirement for informed consent was waived. The study was not financially supported by any grant or company. ScreenPoint Medical provided the software for the study. The authors who were not employees of or consultants for ScreenPoint Medical had control of the data and information submitted for publication at all times.
The data for this study were retrospectively collected from the Córdoba Tomosynthesis Screening Trial (21), a prospective screening trial that collected consecutive examinations in 16 067 women (one woman had a bilateral breast cancer and had two different examinations included in the original trial) who were screened with both two-view DM and two-view DBT between January 2015 and December 2016. The images were acquired with a Selenia Dimensions device (Hologic). This paired trial compared the screening performance of DM alone versus that of DBT (added to DM or with synthetic mammography) in terms of recall rate and cancer detection rate. These data only overlap with those reported in the original publication of the trial (21).
Age, breast density, histopathologic results of biopsy procedures, and interval cancer diagnosis were retrieved from the medical records. Race was not individually recorded, but the majority of the population was White. In addition to the original trial exclusion criteria (21), examinations were excluded if there were problems retrieving the mammograms from the picture archiving and communication system prior to the AI processing.
Original Screening Reading Settings
The DM and DBT images were independently read by four out of five dedicated breast radiologists (including J.L.R.P. and S.R.M., 15 and 3 years of experience, respectively) in four reading arms as described in Figure 1. The readers were blinded to the outcomes of the other arms.
Because of this interpretation setting, it was possible to compute the performance of the following settings, hereinafter referred to as original screening settings: double reading of DM images (if either reader recalls, the case is recalled), double reading of DBT images (if either reader recalls, the case is recalled), and single reading of DBT images (with synthetic mammograms).
The AI system used in this study (Transpara, version 1.6.0; ScreenPoint Medical) was previously investigated in other publications (17,19,20,22–24). This system uses deep learning to detect lesions suspicious for breast cancer on DM and DBT images. The most suspicious findings detected by the system are marked on every image and assigned a score between 1 and 100. Based on the maximum suspicious finding present in the examination, a proprietary conversion table generates an examination score from 1 to 10, indicating the increasing likelihood that a visible cancer is present on the mammogram. The DBT images and the DM images of each examination were independently processed by the AI system, resulting in two AI scores per examination: an AI-DM score and an AI-DBT score.
AI-based Screening Strategy
For each original setting, an autonomous AI triaging screening strategy was retrospectively simulated, aiming to reduce workload while maintaining sensitivity (detailed in Fig 2).
In this AI strategy, the least suspicious examinations for AI (those assumed very likely normal with an AI score of 7 or lower, approximately 70% according to the device specifications; the cutoff was chosen based on previous research  indicating that replacing double reading with single reading for these very likely normal cases would not reduce screening sensitivity by more than 5%) would not be human-read, and the rest of examinations would be read as in the original setting (single or double reading of DM or DBT images). Additionally, the examinations not recalled by radiologists but within the 2% most suspicious examinations as graded by AI would be automatically recalled in order to potentially improve sensitivity (the cutoff was chosen taking into account radiologists’ recall rate at this site).
The output of the AI triaging was analyzed by a panel of radiologists (J.L.R.P., S.R.M., E.E.C., and M.Á.B., with 20, 8, 3, and 20 years of experience, respectively), and findings were considered true-positive only if the system correctly localized them and assigned them the highest suspicion score at the examination (on the region suspicion scale of 1–100).
First, the distribution of AI examination scores in DM and DBT was computed for different groups of examinations based on ground truth (95% CIs were computed using the Wilson binomial method).
The screening reading workload, sensitivity (including screening-detected and interval cancers), and recall rate (ie, the number of examinations recalled by either the AI system or radiologists divided by total examinations) were compared between each original screening setting and the AI-based screening strategy by using the McNemar test for paired data, with an α of .05 indicating statistical significance. Additionally, the AI-based strategy in DBT was compared with the original double reading of DM.
Screening workload was defined as the number of readings, and an estimate in hours was computed using the average reading time per examination originally reported in this cohort (21): 25 seconds for a DM examination and 64 seconds for a DBT plus DM or synthetic mammography examination.
To control for multiple comparisons (four in total; see Fig 2), Bonferroni correction was applied. P = .013 (ie, .05/4) was considered to indicate a significant difference after Bonferroni correction. To control for multiple comparisons of the end point metrics (workload, sensitivity, and recall rate), these were tested sequentially for each comparison.
The hypothesis was that in the AI-based strategy, workload could be significantly reduced, with noninferior sensitivity and recall rate (prespecified noninferiority margin difference of 5%, in relative terms). Noninferiority was concluded if the sensitivity or the recall rate was superior (higher sensitivity, lower recall rate) in the AI-based setting, and the lower limit of the 95% CI of the difference was greater than the negative value of the prespecified noninferiority margin. If noninferiority was concluded, superior sensitivity and recall rate in the AI-based strategies were sequentially tested using the McNemar test.
Participant and Examination Characteristics
From the 16 067 women in the cohort, 15 987 examinations in 15 986 women (mean age ± standard deviation, 58 years ± 6) were included (99.5%) (Fig 3). Eighty-one examinations (five noncancer recalled examinations and 76 normal examinations) from 81 women were excluded because of problems retrieving the data from the picture archiving and communication system. In total, 113 examinations were labeled as showing cancers (98 screening-detected and 15 interval). The characteristics of the selected cohort are included in Tables 1 and 2.
Distribution of AI Scores
The distribution of AI scores across the different groups of examinations in the cohort is shown in Figure 4, computed for both DM and DBT examinations independently. The distribution of AI scores is homogeneous for all screening examinations (approximately 10% in each score category), whereas only a minority of screening-detected cancers were scored 1–7: two of 76 DM-based screening-detected cancers (2.6%; 95% CI: 0.72, 9.10) and one of 92 DBT-based screening-detected cancers (1.1%; 95% CI: 0.19, 5.90). At the same time, AI examinations scored 1–7 comprise 11 437 of 15 987 of the DM-based screening volume (71.5%; 95% CI: 70.8, 72.2) and 11 572 of 15 987 of the DBT-based screening volume (72.4%; 95% CI: 71.7, 73.1).
Given that this group of cases with scores 1–7 includes less than 5% of screening-detected cancers, it was estimated that this is an optimal cutoff point to differentiate likely normal examinations in the proposed AI-based strategies (negative predictive value, 99.98% [95% CI: 99.94, 99.99] in DM and 99.99% [95% CI: 99.95, 99.99] in DBT), similar to previous studies (20).
Simulated AI-based Strategy
The comparison of the original screening strategy with the AI-based strategy is presented in Table 3. Consistently across DM-based and DBT-based screening, a workload reduction of 71.5% (9100 of 31 974 reads; 95% CI: 70.6, 72.4) and 72.4% (8830 of 31 974 reads; 95% CI: 71.9, 72.9) was observed with the AI-based screening strategy.
The AI-based strategy resulted in noninferior sensitivity across different screening settings: 76 of 113 cancers detected in double reading of DM images versus 78 of 113 when using AI (relative difference, 2.63%; 95% CI: –4.9, 11.4; P = .68); 92 of 113 cancers detected in double reading of DBT images versus 95 of 113 when using AI (relative difference, 3.26%; 95% CI: –2.2, 9.4; P = .38); and 87 of 113 cancers detected in single reading of DBT images versus 90 of 113 when using AI (relative difference, 3.45%; 95% CI: –1.2, 9.8; P = .38).
When compared with double readings, the AI-based strategy was associated with an overall reduction in recall rate of 16.9% (671 of 15 987 women recalled with AI vs 807 of 15 987 without AI; 95% CI: 11.0, 24.0; P < .001) and 16.7% (588 of 15 987 women recalled with AI vs 706 of 15 987 without AI; 95% CI: 8.6, 23.4; P < .001) in DM and DBT double readings settings, respectively. When compared with single reading of DBT images, recall rate showed a nonsignificant increment (499 of 15 987 women recalled with AI vs 481 of 15 987 without AI; relative difference, 3.74%; 95% CI: –3.77, 12.83; P = .41).
Examinations Recalled Only by AI
In double reading of DM images, AI additionally recalled a total of 210 examinations (four of which were true-positive results). In double reading of DBT images, AI additionally recalled a total of 206 examinations (four of which were true-positive results). In single reading of DBT images, AI additionally recalled a total of 218 examinations (four of which were true-positive results). Therefore, in the group of additional cases recalled by AI only, the positive predictive value ranged from 1.8% to 1.9%.
The four cancers added by AI at DM examinations were all originally screening-detected with DBT only (two of the four were ductal carcinoma in situ, one was a low-grade invasive ductal cancer, and one was a high-grade invasive ductal cancer).
Among the four cancers added by AI at DBT examinations, two were originally screening-detected at DM and two were interval cancers (in total, three of the four were ductal carcinoma in situ and one was a high-grade invasive ductal cancer). Thirteen of 15 interval cancers were not detected with any AI-based strategy (not present in the top 2% of suspicion among AI scores), although nine of these interval cancers are included in the group of examinations with AI scores of 8–10 (the top 30% of suspicion among AI scores).
Comparison of Unaided Double Reading of DM Images with AI-based Double Reading of DBT Images
When comparing the AI-based strategy of DBT to the original double reading of DM (Table 4), it was observed that AI-based DBT screening would have been carried out with a smaller workload (156 hours vs 222 hours, a relative workload reduction of 29.7% [95% CI: 23.8, 36.2], P < .001). The sensitivity would have been 25.0% higher in relative terms (95% CI: 15.8, 36.3; P < .001), with 95 of 113 cancers detected with AI-DBT screening (84.1%; 95% CI: 76.2, 89.7) and 76 of 113 with unaided DM screening (67.3%; 95% CI: 58.2, 75.2). Moreover, the recall rate would have been 27.1% lower in relative terms (95% CI: 24.1, 30.3; P < .001), with 588 of 15 987 women recalled with AI-DBT screening (3.7%; 95% CI: 3.4, 4.0) and 807 of 15 987 women recalled with unaided DM screening (5.1%; 95% CI: 4.7, 5.4).
Current breast cancer screening programs have a high workload for radiologists and an objectionable number of false-positive and false-negative assessments. Our findings highlight how an artificial intelligence (AI) system could reduce up to 70% of the workload in digital mammography (DM)– and digital breast tomosynthesis (DBT)–based breast cancer screening without reducing the sensitivity by 5% or more, indicating that workload can be reduced while maintaining the overall program sensitivity. This was achieved when this proportion of least suspicious examinations for AI would not be read by radiologists, while, at the same time, AI could be used as an additional complementary reader to recall cases not recalled by radiologists. Letting radiologists read this group of the 70% least suspicious examinations led to more recalls. Moreover, AI could be used to transition from DM screening to DBT screening with a 30% reduction in workload (P < .001), a 25% improvement in sensitivity (P < .001), and a 27% reduction in recall rate (P < .001).
Although several studies investigated how AI could reduce workload in screening programs with DM (18–20,25), to our knowledge, this is the first study to investigate AI-based strategies to reduce workload in DBT using real screening cohorts. Furthermore, because our study uses paired DM and DBT examinations, it was possible to determine AI-based strategies for DBT that could replace standard DM screening without increasing workload, one of the biggest limitations of introducing DBT into clinical practice. To our knowledge, our results have not been reported in any other comparison between DM- and DBT-based screening where transitioning to DBT is always associated with an increase in workload (11,26).
Previous studies in DM have suggested that it could be safe (ie, no sensitivity reduction) to use AI to reduce screening workload between 20% and 50% (18–20). In our study, we found this to be 70%. This threshold of 70% to define the optimal group of least suspicious examinations was proposed by Balta et al (20) using the same AI system in a DM screening cohort and could also be reproduced in our study (including DBT). In comparison, earlier studies using previous versions of the same AI system found that the group of the 20% least suspicious examinations would be the most optimal threshold (19), also suggesting how the continuous development of AI systems could keep bringing this threshold further up in the future.
Our study has limitations. It was only performed with data from a single site and single mammography and AI vendor. Moreover, because it was a retrospective study and the AI scenarios were simulated, it is not possible to know the impact on radiologists’ performance in the setting where they would, for example, read only the 30% most suspicious screening examinations. In addition, in the analysis of the AI system, readers were blinded to prior examinations, as opposed to radiologist screening assessments, which requires further analyses to understand the clinical impact of using AI in screening when AI does not include prior information. Finally, although our results suggest that no human reading of low-suspicion examinations would be the most optimal for the screening program cost-efficiency, further legal discussions would be needed to establish a framework where this strategy is safe for all the parties involved in screening.
In conclusion, our study shows a strategy with an artificial intelligence (AI) system where screening workload could be safely reduced up to 70% for both digital mammography (DM)– and digital breast tomosynthesis (DBT)–based programs, as well as allow the transition from DM- to DBT-based screening without an increase in workload. Given the increasing lack of expert breast radiologists as well as the increased workload associated with the introduction of DBT, new strategies potentially using AI could be necessary to maintain the cost-efficiency of screening programs. Further prospective studies are needed to validate our findings.Disclosures of Conflicts of Interest: J.L.R.P. disclosed no relevant relationships. J.L.R.P. disclosed no relevant relationships. S.R.M. disclosed no relevant relationships. E.E.C. disclosed no relevant relationships. A.G.M. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is an employee of ScreenPoint Medical. Other relationships: disclosed no relevant relationships. A.R.R. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is an employee of ScreenPoint Medical. Other relationships: disclosed no relevant relationships. M.Á.B. disclosed no relevant relationships.
The authors thank the Department of Informatics at Hospital Universitario Reina Sofía for their help in retrieving images from picture archiving and communication system and their support in processing them.
Author contributions: Guarantors of integrity of entire study, J.L.R.P., M.Á.B.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, J.L.R.P., E.E.C., A.R.R., M.Á.B.; clinical studies, J.L.R.P., E.E.C., A.R.R., M.Á.B.; experimental studies, A.G.M.; statistical analysis, J.L.R.P.; and manuscript editing, J.L.R.P., S.R.M., A.G.M., A.R.R., M.Á.B.
The study was funded by the Hospital Universitario Reina Sofía in Córdoba, Spain.
- 1. . Cancer screening: evidence and practice in Europe 2008. Eur J Cancer 2008;44(10):1404–1413. Crossref, Medline, Google Scholar
- 2. . Cancer screening in the United States, 2010: a review of current American Cancer Society guidelines and issues in cancer screening. CA Cancer J Clin 2010;60(2):99–119. Crossref, Medline, Google Scholar
- 3. . Mammographic density and the risk and detection of breast cancer. N Engl J Med 2007;356(3):227–236. Crossref, Medline, Google Scholar
- 4. . Breast cancer screening: current status [in Spanish]. Radiología 2013;55(4):305–314. Medline, Google Scholar
- 5. . Cumulative false positive recall rate and association with participant related factors in a population based breast cancer screening programme. J Epidemiol Community Health 2006;60(4):316–321. Crossref, Medline, Google Scholar
- 6. GLOBOCAN. Cancer Today. International Agency for Research on Cancer. World Health Organization, 2018.http://gco.iarc.fr/today. Accessed August 21,2020. Google Scholar
- 7. . Prospective trial comparing full-field digital mammography (FFDM) versus combined FFDM and tomosynthesis in a population-based screening programme using independent double reading with arbitration. Eur Radiol 2013;23(8):2061–2071. Crossref, Medline, Google Scholar
- 8. . Integration of 3D digital mammography with tomosynthesis for population breast-cancer screening (STORM): a prospective comparison study. Lancet Oncol 2013;14(7):583–589. Crossref, Medline, Google Scholar
- 9. . Performance of one-view breast tomosynthesis as a stand-alone breast cancer screening modality: results from the Malmö Breast Tomosynthesis Screening Trial, a population-based study. Eur Radiol 2016;26(1):184–190. Crossref, Medline, Google Scholar
- 10. . Digital Mammography versus Digital Mammography Plus Tomosynthesis for Breast Cancer Screening: The Reggio Emilia Tomosynthesis Randomized Trial. Radiology 2018;288(2):375–385. Link, Google Scholar
- 11. . Digital Breast Tomosynthesis with Synthesized Two-Dimensional Images versus Full-Field Digital Mammography for Population Screening: Outcomes from the Verona Screening Program. Radiology 2018;287(1):37–46. Link, Google Scholar
- 12. . Application of breast tomosynthesis in screening: incremental effect on mammography acquisition and reading time. Br J Radiol 2012;85(1020):e1174–e1178. Crossref, Medline, Google Scholar
- 13. . A randomized controlled trial of digital breast tomosynthesis versus digital mammography in population-based screening in Bergen: interim analysis of performance indicators from the To-Be trial. Eur Radiol 2019;29(3):1175–1186. Crossref, Medline, Google Scholar
- 14. . Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection. JAMA Intern Med 2015;175(11):1828–1837. Crossref, Medline, Google Scholar
- 15. . Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health 2020;2(3):e138–e148. Crossref, Medline, Google Scholar
- 16. . International evaluation of an AI system for breast cancer screening. Nature 2020;577(7788):89–94. Crossref, Medline, Google Scholar
- 17. . Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison With 101 Radiologists. J Natl Cancer Inst 2019;111(9):916–922. Crossref, Medline, Google Scholar
- 18. . A Deep Learning Model to Triage Screening Mammograms: A Simulation Study. Radiology 2019;293(1):38–46. Link, Google Scholar
- 19. . Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol 2019;29(9):4825–4832. Crossref, Medline, Google Scholar
- 20. . Going from double to single reading for screening exams labeled as likely normal by AI: what is the impact?. In: Bosmans H, Marshall N, Van Ongeval C, eds.
Proceedings of SPIE: 15th International Workshop on Breast Imaging (IWBI2020). Vol 11513. Bellingham, Wash: International Society for Optics and Photonics, 2020; 115130D. Crossref, Google Scholar
- 21. . Prospective study aiming to compare 2D mammography and tomosynthesis + synthesized mammography in terms of cancer detection and recall. From double reading of 2D mammography to single reading of tomosynthesis. Eur Radiol 2018;28(6):2484–2491. Crossref, Medline, Google Scholar
- 22. . Detection of Breast Cancer with Mammography: Effect of an Artificial Intelligence Support System. Radiology 2019;290(2):305–314. Link, Google Scholar
- 23. . Artificial intelligence for breast cancer detection in mammography: experience of use of the ScreenPoint Medical Transpara system in 310 Japanese women. Breast Cancer 2020;27(4):642–651. Crossref, Medline, Google Scholar
- 24. . The effect of breast density on the performance of deep learning-based breast cancer detection methods for mammography. In: Bosmans H, Marshall N, Van Ongeval C, eds.Proceedings of SPIE: 15th International Workshop on Breast Imaging(IWBI2020). Vol 11513.Bellingham, Wash: International Society for Optics and Photonics, 2020; 1151324. Crossref, Google Scholar
- 25. . Improving Workflow Efficiency for Mammography Using Machine Learning. J Am Coll Radiol 2020;17(1 Pt A):56–63. Crossref, Medline, Google Scholar
- 26. . Addition of tomosynthesis to conventional digital mammography: effect on image interpretation time of screening examinations. Radiology 2014;270(1):49–56. Link, Google Scholar
Article HistoryReceived: Aug 31 2020
Revision requested: Oct 23 2020
Revision received: Jan 5 2021
Accepted: Jan 14 2021
Published online: May 04 2021
Published in print: July 2021