Effect of Radiologists’ Diagnostic Work-up Volume on Interpretive Performance
To examine radiologists’ screening performance in relation to the number of diagnostic work-ups performed after abnormal findings are discovered at screening mammography by the same radiologist or by different radiologists.
Materials and Methods
In an institutional review board–approved HIPAA-compliant study, the authors linked 651 671 screening mammograms interpreted from 2002 to 2006 by 96 radiologists in the Breast Cancer Surveillance Consortium to cancer registries (standard of reference) to evaluate the performance of screening mammography (sensitivity, false-positive rate [FPR], and cancer detection rate [CDR]). Logistic regression was used to assess the association between the volume of recalled screening mammograms (“own” mammograms, where the radiologist who interpreted the diagnostic image was the same radiologist who had interpreted the screening image, and “any” mammograms, where the radiologist who interpreted the diagnostic image may or may not have been the radiologist who interpreted the screening image) and screening performance and whether the association between total annual volume and performance differed according to the volume of diagnostic work-up.
Annually, 38% of radiologists performed the diagnostic work-up for 25 or fewer of their own recalled screening mammograms, 24% performed the work-up for 0–50, and 39% performed the work-up for more than 50. For the work-up of recalled screening mammograms from any radiologist, 24% of radiologists performed the work-up for 0–50 mammograms, 32% performed the work-up for 51–125, and 44% performed the work-up for more than 125. With increasing numbers of radiologist work-ups for their own recalled mammograms, the sensitivity (P = .039), FPR (P = .004), and CDR (P < .001) of screening mammography increased, yielding a stepped increase in women recalled per cancer detected from 17.4 for 25 or fewer mammograms to 24.6 for more than 50 mammograms. Increases in work-ups for any radiologist yielded significant increases in FPR (P = .011) and CDR (P = .001) and a nonsignificant increase in sensitivity (P = .15). Radiologists with a lower annual volume of any work-ups had consistently lower FPR, sensitivity, and CDR at all annual interpretive volumes.
These findings support the hypothesis that radiologists may improve their screening performance by performing the diagnostic work-up for their own recalled screening mammograms and directly receiving feedback afforded by means of the outcomes associated with their initial decision to recall. Arranging for radiologists to work up a minimum number of their own recalled cases could improve screening performance but would need systems to facilitate this workflow.
© RSNA, 2014
A 2005 Institute of Medicine report (1) noted that the technical quality of mammography has improved since the 1992 Mammography Quality Standards Act but that optimal sensitivity and specificity have not been achieved—a conclusion reinforced by recent investigations (2). The Institute of Medicine report called for additional research on the relationship between interpretive volume and performance (1). Results on the association between mammography performance and volume, although inconsistent, generally suggest that higher-volume readers have lower false-positive rates (FPRs); findings on sensitivity are mixed (3). To address these gaps between observed and optimal screening accuracy (1), previous studies examined the relationship between interpretive volume and screening and diagnostic performance (4,5).
Contrary to the hypothesis suggested by the Institute of Medicine report that a higher interpretive volume would improve mammography performance, a study of a sample of U.S. radiologists found that volume did not explain much of the observed interradiologist variability in screening or diagnostic performance (4,5). The FPRs of radiologists with higher annual volumes were clinically and significantly lower than those of their lower-volume colleagues; however, the sensitivities were similar (4,5). Interpretive volume composition (ratio of screening volume relative to total volume) had the strongest influence on screening and diagnostic performance; a higher screening focus (ratio of screening to diagnostic mammograms) was associated with significantly lower screening sensitivity, cancer detection rate (CDR), and FPR (4,5), which suggests that having some element of diagnostic work-up could increase sensitivity and CDR. To our knowledge, only one study has examined whether radiologists’ accuracy (defined as positive predictive value for biopsy recommendation) was influenced by monitoring a woman’s images throughout the diagnostic process and found no significant influence (6). These findings indicate that interpretive volume alone is not the principal influence on performance; rather, volume might affect performance by allowing radiologists the opportunity to enhance their interpretative skills by performing work-up for diagnostic images that result from recalled screening mammograms interpreted by themselves or by other radiologists.
The purpose of this study was to to examine radiologists’ screening performance in relation to the number of diagnostic examination work-ups after abnormal findings are discovered at screening mammography performed by the same radiologist or by different radiologists. In addition, we determined whether work-up of abnormal screening mammograms modified the association between annual interpretive volume and screening performance.
Materials and Methods
The Breast Cancer Surveillance Consortium (BCSC) registries and Statistical Coordinating Center received institutional review board approval for active or passive consenting processes and a Federal Certificate of Confidentiality and other protections for participating women, physicians, and facilities. All procedures were compliant with the Health Insurance Portability and Accountability Act (7).
BCSC registries collect information about mammography performed at participating facilities in their defined catchment areas and link this information to state tumor registries or regional Surveillance Epidemiology and End Results programs to obtain population-based cancer data (8,9). Demographic and breast cancer risk factor data, including age, first-degree family history, and time since last mammographic examination, are collected with use of a self-reported questionnaire completed at each screening mammography examination. This study included data from six BCSC mammography registries (in California, North Carolina, New Hampshire, Vermont, Washington, and New Mexico). Because planned analysis required complete capture of all screening and diagnostic images for each radiologist, we limited our sample to radiologists who interpreted mammograms only in BCSC facilities (436 reader-years, 106 radiologists) (4,5). Eligible radiologists from the six registries who interpreted screening mammograms from 2005 to 2006 were invited to complete a self-administered mailed survey between January 1, 2006, and September 30, 2007 (10), and survey information was linked to BCSC data. We excluded 10 radiologists (43 reader-years) who interpreted mammograms at facilities with incomplete BCSC data on diagnostic mammograms during the study years. The radiologists and reader-years included in these analyses are a subset of those previously reported (5).
The two primary exposures of interest were “own” work-ups and “any” work-ups. The measurement of “own” started at a recalled screening examination and determined the number of those recalled screening mammograms with a diagnostic work-up (mammography with or without ultrasonography [US], counted as one examination) within 60 days (11) interpreted by the same radiologist who recalled the screening mammogram. The measurement of “any” started with the interpretation of any diagnostic mammogram (with or without US, counted as one examination), regardless of which radiologist recalled the screening mammogram. We followed the Breast Imaging Reporting and Data System (BI-RADS) lexicon and collected one overall assessment for diagnostic examinations with or without US (12,13). Because any work-ups included all diagnostic mammograms obtained for work-up of a positive screening examination, unlike the own work-ups, we did not require linkage to the recalled screening examination or that the diagnostic work-up be performed within 60 days of the screening examination. The two exposure measures overlap. For example, a work-up would be counted as both own and any if the same radiologist recalled the screening mammogram and interpreted the diagnostic follow-up mammogram within 60 days of the screening examination. Therefore, most diagnostic follow-up mammograms classified as own work-ups were also included as any work-ups (except when the only diagnostic follow-up was US).
Annual interpretive volume for 2001–2005 was collected and summed across all facilities for total, screening, and diagnostic volumes. Examination type was defined by using radiologists’ indications for examinations (5). Diagnostic examinations included additional evaluation of a previous abnormal screening mammogram, short-interval follow-up, or evaluation of a breast symptom or mammographic abnormality with or without US.
Screening performance was based on the radiologist’s initial assessment (positive or negative) of the screening mammogram linked to invasive carcinoma or ductal carcinoma in situ (DCIS) diagnoses collected from tumor registries and pathology databases and diagnosed within the follow-up period (1 year after the screening mammogram and before the next screening mammography examination) (13). Registry data were used to characterize the tumors with regard to histologic characteristic (DCIS vs invasive), stage (0–IV), tumor size, axillary lymph node involvement (negative or positive), grade (well differentiated to undifferentiated), and estrogen receptor status. We defined minimal detected and early stage cancers in three ways, as follows: (a)DCIS or invasive cancer 10 mm or smaller (12), (b)DCIS or invasive cancer smaller than 15 mm and node negative (4,5), or (c)DCIS or invasive cancer 10 mm or smaller and node negative (4,5).
Performance measures (sensitivity, FPR, and CDR) were derived from 651 671 screening mammograms (404 538 unique women, asymptomatic subjects with routine screening indication) interpreted from 2002 to 2006 by using standard BI-RADS and BCSC definitions (5). The mammograms and unique women reported herein are a subset of those previously reported (5). Sensitivity was defined as the proportion of screening mammograms interpreted as positive (defined as BI-RADS categories 0 [needs additional assessment], 4 [suspicious abnormality], 5 [highly suggestive of malignancy], or 3 [probably benign when associated with a recommendation for immediate follow-up, ie, more imaging, clinical examination, biopsy]) (12) diagnosed within the follow-up period. The FPR was defined as the proportion of positive screening examinations among all women without a breast cancer diagnosis within the follow-up period. The CDR was defined as the number of cancers detected within the follow-up period per 1000 screening mammograms interpreted.
The two work-up volume measures (own and any) and annual total interpretive volume measures for each year were linked to screening performance in the following year (eg, 2005 volume was linked to 2006 performance). The Pearson correlation coefficient was used to estimate the strength of the linear relationship between the continuous measures of work-up volume. Two breast imaging specialists (E.A.S. and B.S.M., with 38 and 34 years of experience, respectively) evaluated the data with all coauthors to assess feasibility for measurement and implementation and classified volumes as low (<25 mammograms), medium (26–50 mammograms), and high (>50 mammograms) and any work-up into low (<50 mammograms), medium (51–125 mammograms), and high (>125 mammograms). We calculated unadjusted screening performance by using these categoric volume diagnostic work-up measures. To assess the potential trade-off between sensitivity and FPR, we calculated the number of women recalled for each cancer detected. All P values are two sided.
Because a radiologist’s case-mix distribution (average age and screening intervals) might have an effect on results, we computed adjusted performance measures by using internal standardization (14) to account for differences in radiologists’ case-mix distributions (5). Internal standardization works by reweighting mammograms according to the relative difference between the radiologist’s specific distribution of potential confounders (age and time since last mammography examination) and the corresponding distribution in the overall analytic sample. This process enables calculation of performance measures for radiologists as if their case mixes were the same as that in the overall population. To assess the relationship between the continuous work-up measures and adjusted performance, we stratified according to cancer status, fitting separate models for each performance measure by using the radiologist’s initial mammographic assessment (positive or negative) as the binary outcome variable. Continuous diagnostic work-up measures were included in the regression models by using restricted cubic smoothing splines (15) to allow for nonlinearity and to permit a flexible shape for the relationship between the continuous volume measure and interpretive performance. We fit logistic models by using generalized estimating equations with robust standard errors to account for correlation between multiple observations from the same radiologist. Because the diagnostic work-up measures were heavily skewed with sparse data in high volumes, we restricted the range before fitting the models to ensure stable estimates of model parameters; therefore, model estimation excluded outliers (radiologists with >250 own and >600 any recalled mammograms). Model results are presented graphically with 95% confidence intervals (CIs), with the curves interpreted directly as the mean adjusted performance as a function of the exposure measure. P values for the estimated curves correspond to omnibus tests of whether there is any relationship between mean adjusted performance and work-up volume.
Similar methods were used to test the hypothesis that the relationship between total interpretive volume and screening performance is different for radiologists with low versus high volumes of diagnostic work-ups. Logistic regression models with cubic smoothing splines were used to estimate performance as a function of total interpretive volume. We restricted the range of total volume to 6000 or fewer mammograms because of sparse data in the tails. Interaction terms were included in the model to estimate separate curves for low and high levels of diagnostic work-up. P values correspond to omnibus tests of whether there is a difference in the shape (interaction term to assess effect modification) of the volume-performance relationship for radiologists with low versus high volumes of work-ups of recalled screening mammograms. Model results are presented graphically, with separate curves for low and high diagnostic work-up volume; the curves are the mean adjusted performance as a function of the total annual volume.
The 96 radiologists in the study had a median age of 53 years (range, 37–72 years). Most radiologists worked full time (76%), had at least 20 years of experience (53%), and did not have fellowship training in breast imaging (95%) (Table 1). Time spent on breast imaging varied and was less than 20% for 24% of radiologists and 80%–100% for 32%. Thirty-eight percent of radiologists worked up 25 or fewer of their own recalled screening mammograms a year, 24% worked up 0–50, and 39% worked up more than 50. Twenty-four percent of radiologists worked up 0–50 of any recalled screening mammograms, 32% worked up 51–125, and 44% worked up more than 125.
Radiologists who performed work-up for a greater number of own or any recalled screening mammograms were more likely to have completed fellowship training, have greater annual interpretive volumes, and spend more than 40% of their time on breast imaging (Table 1). Associations between working up own and any recalled mammograms with volume (total and diagnostic) were similar, with higher-volume readers interpreting more own and any recalled screening mammograms. The work-up of own and any mammograms showed a positive correlation (Pearson correlation coefficient = 0.49, P < .01) (Fig E1 [online]).
The low-, medium-, and high-volume categories for the work-up of own recalled mammograms included 25%, 21%, and 53% of the screening mammograms used to calculate screening performance; the low-, medium-, and high-volume categories for the work-up of any recalled mammograms included 13%, 27%, and 60% of the screening mammograms used to calculate screening performance (Table 2). The characteristics of the women according to age, first-degree family history, or time since last mammographic examination did not differ according to low-, medium-, or high-volume category for either exposure measure (ie, own or any work-up). Most screening mammograms included in the performance outcome measures were obtained in women aged 40–59 years (60%), with 3% obtained in women younger than 40 years and 5% in women aged at least 80 years. The characteristics of women who had their mammograms interpreted at academic medical centers were not different from those of women whose mammograms were interpreted at nonacademic facilities (Table E1 [online]).
There were 3101 cancers in the study population; 2646 were detected with screening mammography. Among invasive cancers, stage distribution and median tumor size did not vary according to either exposure measure (own or any work-up) (Table 3). Of 455 interval-detected cancers, 89% were invasive cancers with a larger median size (19 mm) and a higher fraction (25%) were estrogen receptor–negative compared with screening-detected cancers (Table E2 [online]).
The unadjusted mean sensitivity was 85.3% (95% CI: 83.6%, 86.9%), the FPR was 9.1% (95% CI: 8.0%, 10.3%), and the CDR was 4.1 per 1000 screening mammograms (95% CI: 3.7, 4.5) (Table 4). As the number of own work-ups increased, the adjusted sensitivity, FPR, and CDR significantly increased (Fig 1), yielding a stepped increase in the number of women recalled per cancer detected from 17.4 for 25 or fewer mammograms to 24.6 for more than 50 mammograms (Table 4). Improved sensitivity and CDR were accompanied by an increase in the FPR with each category of own work-ups, which was consistent with the figures showing little improvement for volumes of more than 50 own work-ups or more than 125 any work-ups (Fig 2). The one exception was CDR, where CDR was significantly (P = .039) reduced with increasing annual volume for radiologists who interpreted fewer than 50 of their own recalled mammograms.
Unadjusted sensitivity increased from 80.8% for radiologists with 50 or fewer any work-ups to 86.5% for those with more than 125 any work-ups (Table 4); however, the association between adjusted sensitivity and volume of any work-up was not significant (P = .15) (Fig 1d). An increase in the volume of any work-up yielded statistically significant increases in the FPR and CDR (Fig 1e, 1f).
Overall, 22.2 women out of 1000 were recalled for each cancer detected. The lowest number of women recalled per cancer detected was among radiologists who worked up the fewest numbers of own and any recalled mammograms; however, these radiologists also had the lowest sensitivity. Radiologists with the highest sensitivity and CDR and the lowest FPR had worked up more than 25 of own recalled mammograms or more than 50 of any recalled mammograms.
In general, the shape of the relationship between total interpretive volume and screening performance did not differ according to a low versus high volume of diagnostic follow-up (Fig 2). However, readers with fewer own or any work-ups had consistently lower sensitivity, FPR, and CDR at any given total volume. The stratified analysis also showed decreased FPRs with increasing total annual volume to a threshold of 2000.
We found that radiologists with a higher annual volume of work-ups for recalled screening mammograms they initially interpreted had consistently higher screening sensitivities and CDRs; however, these performance improvements were accompanied by higher FPRs. We expected that a higher volume of diagnostic work-ups for a radiologist who interpreted the screening mammogram would be associated with better screening performance because of the radiologist’s involvement throughout a case, possibly including interventional procedures (18,19). This constitutes direct feedback on the radiologist’s clinical decisions. Performing analysis with continuous measures and accounting for potential confounders resulted in improved sensitivity for radiologists who annually work up diagnostic examinations resulting from at least 50 of their own recalled mammograms and a higher CDR for radiologists who annually work up more than 125 of any recalled mammograms. Despite variability in performance measures, on average, radiologists who worked up fewer recalled mammograms had consistently lower sensitivity, CDRs, and FPRs at any given total volume. Current U.S. Food and Drug Administration regulations require U.S. physicians to have interpreted 960 mammograms within the previous 24 months to meet continuing experience requirements. However, the regulations have no requirements about the indication for the examination (ie, they could all be screening examinations or they could all be diagnostic examinations).
We previously examined annual interpretive volume and screening (4) and diagnostic performance (5) and reported wide, unexplained variability in screening and diagnostic performance across radiologists within volume levels. We had expected to see the relationship between total interpretive volume (screening plus diagnostic images) or screening volume to be most strongly associated with screening performance and total volume or diagnostic volume to be most strongly associated with diagnostic performance. Instead, the composition of interpretive volume (ratio of screening volume relative to total volume) was the greatest important factor influencing screening and diagnostic performance. Radiologists with higher annual volumes had clinically and significantly lower FPRs than their lower-volume colleagues but similar sensitivities (4,5). These earlier findings (4,5), combined with these current findings, suggest that increasing the current U.S. Mammography Quality Standards Act requirements for interpretive volume and requiring a minimum number of diagnostic work-ups for a radiologist’s recalled screening mammograms could improve a radiologist’s screening performance.
Despite previous findings suggesting that the proportion of screening or diagnostic examinations is most strongly associated with screening performance (4,5), we chose to investigate the total number of examinations rather than proportions because tracking examination numbers might be more practical for practices and radiologists. Tracking proportions requires more robust data collection, including total numbers of examinations according to type and proportion. Many mammography facilities cannot provide complete data volume according to interpretation type (screening vs diagnostic). Facilities also might not be able to link recalled mammograms with the radiologist who worked up the examinations. However, these findings suggest new approaches to organizing clinical work, and tracking screening and diagnostic volume may be worthwhile given the potential to improve interpretative performance.
Increasing the minimum number of interpretations might cause some radiologists with lower annual volumes to stop interpreting mammograms. Conversely, these findings may motivate those radiologists to increase their volumes. Workforce issues may also be less relevant today with the increasing use of digital mammography, which allows radiologists to interpret examinations remotely. Our data support a minimum annual interpretive volume coupled with annual work-up of at least 50 of a radiologist’s own recalled mammograms. This recommendation would require changes in how facilities capture and report current Mammography Quality Standards Act interpretation requirements and may require some facilities to reorganize their workflow. In addition, in the absence of a national reporting registry, tracking interpretive requirements across facilities, particularly if volume requirements span multiple years, would be challenging. An alternative would be to have radiologists review the work-up of a portion of their own recalled mammograms (category 0), even if they were not the radiologist that performed the work-up.
A common assumption is that improvements in sensitivity come at the expense of specificity, and vice versa, as reflected in traditional receiver operating curve analysis. However, this is not always the case. It is possible to improve both measures to the point where improvements in one measure reach a threshold beyond which the other is diminished (20). Thus, increases in FPRs associated with the improvement in sensitivity and CDRs potentially could be reduced with use of other strategies to improve interpretative performance, such as interventions for radiologists to improve interpretative performance (21–23), application of performance thresholds (20), providing additional audit feedback by reviewing the lesion that was sampled for biopsy, or providing additional feedback related to improving specificity (24–27). Some women who undergo screening mammography may regard the small increase in the FPR as an acceptable trade-off for improved sensitivity (28–31).
Disentangling the factors that influence interpretive performance (for mammography or any technology) requires in-depth longitudinal examinations on large populations that enable cause and effect to be established. Current Mammography Quality Standards Act requirements had no supporting evidence when they were established to support the specifics of their requirements; they were a well-intentioned judgment call. Years later, we now have evidence demonstrating that the combination of higher volume and direct involvement in working up one’s own recalled screening mammograms is associated with a higher sensitivity and CDR.
Our study had limitations. Mammography performance was derived from examinations performed between 2002 and 2006, when computer-assisted detection and digital mammography were not as ubiquitous as they are now; however, few studies have shown large clinical improvements in performance with these newer technologies (32–37). In addition, during the study period computer-assisted detection was not commonly used in the BCSC (only 29% of screening mammograms). This was not a trial where we manipulated work-up volumes to test whether changing the composition would improve an individual radiologist’s performance. Our analysis did not include double-reading for screening or diagnostic mammograms; however, only 3% of radiologists reported double-reading for 20% or more of their screening mammograms. Instead of prespecifying the categories for exposure distribution, our cutpoints were determined after examining the data. We took this approach to address feasibility for measurement, implementation, and policy. For example, with recalls in the range of 10% and minimum interpretive volume being 980 mammograms during a 2-year period, it would be reasonable for a radiologist to be eligible to review 50 recalled screening mammograms. We also picked cut-off values that we thought would be feasible for implementation rather than basing them only off the variable distribution–in other words, at standard intervals (eg, 50 vs 53 mammograms). Finally, given the small number of breast imaging specialists, we could not examine the influence of own recalled mammograms and specialty training. We also had no information on how radiologists collaborate during the work-up process and therefore could not examine how these interactions affected performance metrics.
Results of our analyses suggest that radiologists’ screening performance could improve with work-up of more than 50 of their own recalled screening mammograms. Our findings support the BI-RADS strong recommendations to track all recalled screening mammograms, for separate auditing for screening and diagnostic examinations, and for more extenstive auditing. This study, combined with previous investigations (4,5), supports an increase in annual volume requirements and a minimum diagnostic volume of recalled screening cases for U.S. radiologists who interpret mammograms.
Advances in Knowledge
■ Radiologists who interpreted a greater annual number of diagnostic mammograms that resulted from recall of screening mammograms they interpreted had consistently higher sensitivity (81.1% for 0–25 mammograms to 87.0% for >50 mammograms, P = .039) and cancer detection rates (CDRs) (3.1 per 1000 screening mammograms for 0–25 mammograms to 4.5 per 1000 screening mammograms for >50 mammograms, P < .001) than radiologists who interpreted fewer of these mammograms; however, the false-positive rate (FPR) was higher (6.7% for 0–25 mammograms to 10.3% for >50 mammograms, P = .004).
■ These performance changes resulted in a stepped increase in the number of women recalled per cancer detected, ranging from 17.4 for radiologists who interpreted 25 or fewer of their recalled mammograms per year to 24.6 for radiologists who interpreted more than 50 of their recalled mammograms per year.
■ Radiologists with a lower annual number of work-ups of recalled screening mammograms (0–50 mammograms vs >125 mammograms) had consistently lower FPRs (7.0% vs 10.3%, P = .15), sensitivity (80.8% vs 86.5%, P = .011), and CDRs (2.9 per 1000 vs 4.4 per 1000, P = .001) at all annual interpretive volumes.
Implication for Patient Care
■ Arranging for radiologists to perform a minimum number of diagnostic work-ups that resulted from recall of screening mammograms they interpreted could improve screening mammography performance in the United States.
The collection of cancer and vital status data used in this study was supported in part by several state public health departments and cancer registries throughout the United States. For a full description of these sources, please see http://www.breastscreening.cancer.gov/work/acknowledgement.html. We thank the BCSC investigators, participating women, mammography facilities, and radiologists for the data they have provided for this study. A list of the BCSC investigators and procedures for requesting BCSC data for research purposes are provided at: http://breastscreening.cancer.gov/. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
Author contributions: Guarantors of integrity of entire study, D.S.M.B., T.L.O.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, D.S.M.B., R.A.S., T.L.O.; clinical studies, D.S.M.B., D.L.M., B.C.Y., T.L.O.; experimental studies, D.S.M.B., B.S.M.; statistical analysis, D.S.M.B., M.L.A., R.A.S., T.L.O.; and manuscript editing, all authors
- 1. . Improving breast imaging quality standards. Washington, DC: National Academies Press, 2005. Google Scholar
- 2. . Time trends in radiologists’ interpretive performance at screening mammography from the community-based Breast Cancer Surveillance Consortium, 1996–2004. Radiology 2010;256(1):74–82. Link, Google Scholar
- 3. . Provider’s volume and quality of breast cancer detection and treatment. Breast Cancer Res Treat 2007;105(2):117–132. Crossref, Medline, Google Scholar
- 4. . Mammographic interpretive volume and diagnostic mammogram interpretation performance in community practice. Radiology 2012;262(1):69–79. Link, Google Scholar
- 5. . Influence of annual interpretive volume on screening mammography performance in the United States. Radiology 2011;259(1):72–84. Link, Google Scholar
- 6. . Positive predictive value of mammography: comparison of interpretations of screening and diagnostic images by the same radiologist and by different radiologists. AJR Am J Roentgenol 2010;195(3):782–785. Crossref, Medline, Google Scholar
- 7. . Individual and combined effects of age, breast density, and hormone replacement therapy use on the accuracy of screening mammography. Ann Intern Med 2003;138(3):168–175. Crossref, Medline, Google Scholar
- 8. . Breast Cancer Surveillance Consortium: a national mammography screening and outcomes database. AJR Am J Roentgenol 1997;169(4):1001–1008. Crossref, Medline, Google Scholar
- 9. . http://breastscreening.cancer.gov/. Updated February 7, 2014. Accessed June 3, 2014. Google Scholar
- 10. . Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy. Radiology 2009;253(3):641–651. Link, Google Scholar
- 11. ; Breast Cancer Surveillance Consortium. Timeliness of follow-up after abnormal screening mammogram: variability of facilities. Radiology 2011;261(2):404–413. Link, Google Scholar
- 12. . ACR BI-RADS — mammography. In: ACR Breast Imaging and Reporting and Data System, breast imaging atlas. 4th ed. Reston, Va: American College of Radiology, 2003. Google Scholar
- 13. . Performance benchmarks for screening mammography. Radiology 2006;241(1):55–66. Link, Google Scholar
- 14. . Introduction to regression modeling. In: Rothman KJ, Greenland S, eds. Modern epidemiology. 2nd ed. Philadelphia, Pa: Lippincott-Raven, 1998; 401–434. Google Scholar
- 15. . Basis expansions and regularization. 5 5. In: The elements of statistical learning: data mining, inference and prediction. New York, NY: Springer-Verlag, 2001; 127–133. Crossref, Google Scholar
- 16. . SAS/GRAPH 9.2 reference. 2nd ed. Cary, NC: SAS Institute, 2010. Google Scholar
- 17. . Stata statistical software: release 12. College Station, Tex: StataCorp, 2011. Google Scholar
- 18. . Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003;95(4):282–290. Crossref, Medline, Google Scholar
- 19. . Statistical approaches for modeling radiologists’ interpretive performance. Acad Radiol 2009;16(2):227–238. Crossref, Medline, Google Scholar
- 20. . Diagnostic mammography: identifying minimally acceptable interpretive performance criteria. Radiology 2013;267(2):359–367. Link, Google Scholar
- 21. . Feasibility and acceptability of conducting a randomized clinical trial designed to improve interpretation of screening mammography. Acad Radiol 2013;20(11):1389–1398. Crossref, Medline, Google Scholar
- 22. . Educational interventions to improve screening mammography interpretation: a randomized, controlled trial. AJR Am J Roentgenol 2014;202(6)W586–W596. Crossref, Medline, Google Scholar
- 23. . Using a tailored web-based intervention to set goals to reduce unnecessary recall. Acad Radiol 2011;18(4):495–503. Crossref, Medline, Google Scholar
- 24. . Initiative to improve mammogram interpretation. Perm J 2004;8(2):12–18. Medline, Google Scholar
- 25. . Web-based mammography audit feedback. AJR Am J Roentgenol 2012;198(6):W562–W567. Crossref, Medline, Google Scholar
- 26. . Best ways to provide feedback to radiologists on mammography performance. AJR Am J Roentgenol 2009;193(1):157–164. Crossref, Medline, Google Scholar
- 27. . Radiologists’ attitudes and use of mammography audit reports. Acad Radiol 2010;17(6):752–760. Crossref, Medline, Google Scholar
- 28. . U.S. women’s attitudes to false positive mammography results and detection of ductal carcinoma in situ: cross sectional survey. BMJ 2000;320(7250):1635–1640. Crossref, Medline, Google Scholar
- 29. . Systematic review: the long-term effects of false-positive mammograms. Ann Intern Med 2007;146(7):502–510. Crossref, Medline, Google Scholar
- 30. . A model of the influence of false-positive mammography screening results on subsequent screening. Health Psychol Rev 2010;4(2):112–127. Crossref, Medline, Google Scholar
- 31. . Influence of false-positive mammography results on subsequent screening: do physician recommendations buffer negative effects? J Med Screen 2012;19(1):35–41. Crossref, Medline, Google Scholar
- 32. . Assessing the stand-alone sensitivity of computer-aided detection with cancer cases from the Digital Mammographic Imaging Screening Trial. AJR Am J Roentgenol 2012;199(3):W392–W401. Crossref, Medline, Google Scholar
- 33. . Effectiveness of computer-aided detection in community mammography practice. J Natl Cancer Inst 2011;103(15):1152–1161. Crossref, Medline, Google Scholar
- 34. . Short-term outcomes of screening mammography using computer-aided detection: a population-based study of medicare enrollees. Ann Intern Med 2013;158(8):580–587. Crossref, Medline, Google Scholar
- 35. . Effects of digital mammography uptake on downstream breast-related care among older women. Med Care 2012;50(12):1053–1059. Crossref, Medline, Google Scholar
- 36. . Comparative effectiveness of digital versus film-screen mammography in community practice in the United States: a cohort study. Ann Intern Med 2011;155(8):493–502. Crossref, Medline, Google Scholar
- 37. . Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353(17):1773–1783. Crossref, Medline, Google Scholar
Article HistoryReceived December 9, 2013; revision requested January 10, 2014; revision received March 24; accepted April 4; final version accepted April 18.
Published online: June 24 2014