Double reading in Breast cancer screening: Cohort Evaluation in the CO-OPS Trial 1

Purpose To investigate the effect of double readings by a second radiologist on recall rates, cancer detection, and characteristics of cancers detected in the National Health Service Breast Screening Program in England. Materials and Methods In this retrospective analysis, 805 206 women were evaluated through screening and diagnostic test results by extracting 1 year of routine data from 33 English breast screening centers. Centers used double reading of digital mammograms, with arbitration if there were discrepant interpretations. Information on reader decisions, with results of follow-up tests, were used to explore the effect of the second reader. The statistical tests used were the test for equality of proportions, the χ2 test for independence, and the t test. Results The first reader recalled 4.76% of women (38 295 of 805 206 women; 95% confidence interval [CI]: 4.71%, 4.80%). Two readers recalled 6.19% of women in total (49 857 of 805 206 women; 95% CI: 6.14%, 6.24%), but arbitration of discordant readings reduced the recall rate to 4.08% (32 863 of 805 206 women; 95% CI: 4.04%, 4.12%; P < .001). A total of 7055 cancers were detected, of which 627 (8.89%; 95% CI: 8.22%, 9.55%; P < .001) were detected by the second reader only. These additional cancers were more likely to be ductal carcinoma in situ (30.5% [183 of 600] vs 22.0% [1344 of 6114]; P < .001), and additional invasive cancers were smaller (mean size, 14.2 vs 16.7 mm; P < .001), had fewer involved nodes, and were likely to be lower grade. Conclusion Double reading with arbitration reduces recall and increases cancer detection compared with single reading. Cancers detected only by the second reader were smaller, of lower grade, and had less nodal involvement. © RSNA, 2018.

Purpose: To investigate the effect of double readings by a second radiologist on recall rates, cancer detection, and characteristics of cancers detected in the National Health Service Breast Screening Program in England.

Materials and Methods:
In this retrospective analysis, 805 206 women were evaluated through screening and diagnostic test results by extracting 1 year of routine data from 33 English breast screening centers. Centers used double reading of digital mammograms, with arbitration if there were discrepant interpretations. Information on reader decisions, with results of follow-up tests, were used to explore the effect of the second reader. The statistical tests used were the test for equality of proportions, the x 2 test for independence, and the t test.

Results:
The first reader recalled 4 Women with symptoms at presentation and women who were tested because of familial or other risk factors were excluded. In this cohort study, we analyzed data from 33 centers: 13 centers were excluded because they used arbitration after both readers agreed to recall. A total of 805 665 women were included in the analysis, all of whom have previously been reported in an analysis of radiologist performance with time on task (25) but none of whom have previously been reported in a comparison of single and double reading.

Procedures
In the United Kingdom, women between the ages of 50 and 70 years are invited to mammographic screening every 3 years, with a trial of age extension from 47 to 73 years. Two views of each breast are obtained, mediolateral oblique and craniocaudal. Mammograms are reviewed by two readers from the same cancers (15), which is potentially undesirable because of the association between DCIS and overdiagnosis (17,18).
Other studies report no differences in the size or stage of cancers between single and double reader programs (9,19). Digital mammography has replaced film mammography in routine clinical practice (20,21). Yet despite the widespread use of digital mammography and double reading, there is little published data on their combined effects. Three small studies, and a meta-analysis of these studies, found no statistically significant difference in cancer detection rates between single and double reader strategies (22)(23)(24)(25). Extra cancers identified by second readers were more likely to be DCIS than invasive carcinomas (23). Posso and colleagues (23,24) have tentatively suggested that single reader screening could reduce costs in breast cancer programs without decreasing cancer detection rates. However, results may be because of small sample sizes, as the second reader detected an extra 10% (n = 24) cancers, but this was not statistically significant (24).
The key limitations of the evidence base are that most data come from film mammography studies (which does not reflect modern breast screening) and that studies using digital mammography have had relatively small samples (maximum = 57 157) and detected few cancers (limiting their power to detect differences). Our purpose was to examine the impact of double reading on recall and cancer detection rates and the characteristics of the additional cancers identified by double reading in the National Health Service Breast Screening Program in England.

Study Design and Participants
This is a population-based cohort study nested within the Changing Case Order to Optimize Patterns of Performance in Screening (CO-OPS) Trial, which included 1 194 147 women between 47 and 73 years of age at 46 screening centers, all between December   reast cancer is a leading cause of cancer in women (1), and many countries have implemented screening programs. Despite concerns about the balance of benefits and harms of these programs, results of randomized controlled trials indicate that screening reduces mortality from breast cancer (2,3).
In many European countries, the interpretation of mammograms is performed by two readers. Recall occurs if (a) either reader suggests it, (b) through consensus, or (c) after arbitration by a third (or more) additional readers (4,5). In the United States, mammograms are typically interpreted by a single reader accompanied by computer-aided detection (6). There is debate about the benefits and costs of single versus double reader programs. Some film mammography studies indicate that double reading increases the number of cancers detected (7-12) but results in the recall of more women (9,(11)(12)(13)(14) and requires more resources (15). It might increase detection of small (,15 mm) cancers (16) and identify a higher ratio of ductal carcinoma in situ (DCIS) to invasive BREAST IMAGING: Double Reading in Breast Cancer Screening Taylor-Phillips et al whether the first reader alone judged that the woman should be recalled. The third (recall if either reader suggests) counts every woman recalled by either reader as recalled.
The number of cancers detected with double reading plus arbitration was compared with the number detected by reader 1 alone. The characteristics of the extra cancers detected by the second reader alone (missed by the first reader) were compared with those of cancers detected by reader 1. The number of involved nodes was grouped into none, one to two, and three or more, as these categories relate to prognosis. The statistical tests used were the test for equality of proportions, the x 2 test for independence, and the t test. We performed a sensitivity analysis assuming all missing data were extreme cases (invasive disease not present, lowest grade, without nodal involvement, or vice versa). The analysis was performed by using R statistical software, version 3.4.1, in RStudio, version 1.0.153 (29). For women at 10 of the first centers to complete the trial, we report the 3-year interval cancer rate and cancer detection rate at their subsequent screening examination 3 years later. These results were divided into three groups: Women who were recalled by the first reader but not by the second reader and arbitration at the current screening, women who were recalled by the second reader but not by the first reader and arbitration at the current screening, and all other women who were not recalled at the current screening (recalled by neither reader). Comparisons between these groups were made by using the test for equality of proportions. If women who had a discordant reading but who were not recalled at the current round had a higher cancer detection rate in the subsequent round, this may indicate that arbitration was incorrect and cancers were missed at the current round. However, it may also be caused by discordant cases having other risk factors for developing cancers between screening rounds, such as increased breast density. As a sensitivity analysis, the number of additional cancers detected by the second reader was recalculated assuming that grade, the number of involved nodes, the pathologic size for women with invasive cancer, and the grade for women with DCIS only.

Data Collection
Data were extracted from the National Breast Screening Service electronic database. We extracted the decisions of the first and second readers (and arbitration, where used) for whether the patient should be recalled for further tests, which are recorded automatically at the point of making the decision. The decision of arbitration was final, and to confirm this we checked against records scheduling the follow-up appointments. For all follow-up appointments, we extracted whether the woman had a biopsy, the biopsy result (pathologic finding), and the result of other follow-up tests used (additional mammography, clinical breast examination, US, and/or MR imaging). We extracted pathologic results after any subsequent surgery. This was used to confirm biopsy results and to report grade, size, and number of involved nodes. We extracted interval cancer rates between screening rounds and cancer detection rates at the following screening round 3 years later for women attending 10 of the first centers to complete the trial, as in these centers sufficient time has elapsed to extract these data. This was used to investigate whether women with discordant readings who were not recalled after arbitration were at increased risk for later cancer detection (which may be an indication of errors in arbitration and potential underestimation of the extra cancers detected by the second reader).

Statistical Analysis
The analysis compared the recall rate from three screening approaches. The first (double reading plus arbitration) was what was used in clinical practice. Two readers independently examined each case and indicated whether they think the woman should be recalled for further tests. If they disagreed, then expert arbitration was used to make the final decision. The second approach (single reader) derives results from breast screening center using digital mammography without computer-aided detection. They are instructed to read batches of women's mammograms independently but can view the other readers' decisions in patient records. They are aware of whether they are the first or second reader in the workflow processes. Twelve of 33 centers used workflow systems designed to blind the second reader to the decision of the first reader. Disagreements between readers were resolved either through a single third reader (n = 11 centers) or by group consensus (n = 22 centers). Arbitration was performed by qualified readers from the same screening center. All readers were accredited by the National Health Service Breast Screening Program; readers undergo formal training, read a minimum of 5000 women's mammograms per year, participate in assessment clinics, audit their own performance, and maintain continuing professional development (4). Each service is expected to perform within set parameters, including cancer detection and recall rates (4). Readers take 35 seconds on average to examine each woman's digital mammograms in the NHS Breast Screening Program (27). Women recalled after screening are offered further tests at assessment, according to national guidelines (28).

Outcomes
The main outcomes were recall and cancer detection rates. Cancer was defined as histologically confirmed invasive cancer or DCIS. Absence of cancer was confirmed either through arbitration by expert readers or follow-up tests including ultrasonography (US), magnetic resonance (MR) imaging, and biopsy. Where 3-year follow-up data were available (for women attending 10 of the first centers to complete the trial), interval cancer rates between screening rounds and cancer detection rates at the following screening were measured as an alternative reference standard to determine the absence of cancer. Secondary outcomes were characteristics of cancers detected, specifically the proportion that included any invasive cancer (rather than DCIS only), the There were 7055 cancers detected by the system of double reading plus arbitration. If there had been only single reading (the first reader decision only), then fewer cancers (n = 6425) would have been detected (P , .001, test of two proportions). The second reader detected an additional 627 cancers that were not detected by the first reader.
The additional cancers detected by the second reader (which were not detected by the first reader) were less likely to contain invasive disease (69.5% For cancers where DCIS only was present (no invasive disease), DCIS grade was lower in cancers detected the difference between cancer detection rate in discordant and nondiscordant readings at the subsequent round was due entirely to cancers missed by arbitration at the current round, and that the differences at the 10 centers would not differ from those across the whole data set.

Results
The flow of women through the study is detailed in Figure 1. Of the 805 665 women screened, 805 206 had complete records of first and second reader screening decisions. A total of 459 women (0.1%) were excluded from further analysis because 44 were examined by a single reader only and recalled for further tests, and 425 were examined by a single reader only and not recalled for further tests. All women had complete records for whether they were recalled for further tests and for whether the results of those further tests showed any type of cancer (DCIS or invasive). The median age of the women included was 59 years (interquartile range, 53-65 years), and 169 753 women (21.1%) were attending their first ever screening appointment.
A total of 7055 cancers were detected. Details of missing data are provided in the    addition to 6536 detected by the first reader.

Discussion
In this large population-based cohort study nested within a trial we found that the addition of a second reader to interpret breast screening mammograms, plus arbitration of discordant examinations, reduced recall rate and increased cancer detection rate. The second reader detected an extra 627 (of 7055 [8.9%]) cancers not detected by the first reader, but these were smaller and lower grade and were less likely to be invasive or have involved nodes. These characteristics are indicative of earlier detection and a potential benefit from less aggressive, more successful treatment, but are also suggestive of overdiagnosis of disease. While overdiagnosis is more associated with smaller, lower grade, noninvasive disease without involved nodes, we cannot accurately predict which individual cancers will develop symptomatically. Previous studies using digital mammography found higher recall rates using double reading (4.8%-4.9%) than single reading (4.6%) (23,24). A recent analysis (23) suggested that double reading may not be cost effective. We found that with effective arbitration of discordant examinations, a second reader can reduce recall rates, but a formal cost-benefit analysis would be needed to assess the incremental benefit of the time involved in the second round of interpretations in the optimal strategy. Previous digital mammography studies have been small and have revealed no statistically significant difference in cancer detection rates or the size, grade, and type of cancer between single and double reading (22)(23)(24). Our study is an order of magnitude larger than these studies and indicates that the addition of a second reader increases cancer detection rates, although the additional cancers detected are smaller, and of a lower grade and stage. The inconsistencies observed between our study and previous studies may reflect the greater statistical power of our study to detect small differences.
Policy makers routinely evaluate how to deliver breast screening in optimal ways. In France, a recommendation has been made to expand the use of second readers on the basis of increased cancer detection (from older film mammography studies) and quality assurance (30). Conversely, Spanish researchers have BREAST IMAGING: Double Reading in Breast Cancer Screening Taylor-Phillips et al recalled at the current round (6.1 per 1000 women in those recalled by reader 2 only and 5.5 per 1000 women in those recalled by reader 1 only) than in other women not recalled at the current round (1.9 per 1000 women screened). Ascertainment of interval cancers is unlikely to be complete (particularly from year 3) because of delays in data transfer from the English cancer registries to screening units. For the same groups of women whose cases were arbitrated and not recalled at the current round, we found a similar excess in cancer detection rates at the subsequent screening round. This excess may be due to a combination of cancers missed at the current screening round by arbitration and cases with a arbitration, and only those recalled by arbitration received further testing (eg, diagnostic biopsy). It is possible that some women not recalled by arbitration did have cancer that was not detected by this reference standard. In a study that predominantly used film mammography, Hofvind and colleagues (31) reported that the rate of interval cancers was higher among women who had discordant interpretations of their mammograms (ie, where one reader recommended recall and the other did not) and who were not recalled than among the whole screening population (2.9 per 1000 vs 1.7 per 1000). Similarly, in our study, interval cancer rates were higher in women whose cases were arbitrated and not suggested that double reading may not be cost effective (on the basis of results of small digital mammography studies suggesting increased recall rates) (23). Our findings indicate an increase in cancer detection with a second reader using digital mammography and that recall rates can be reduced with effective arbitration.
To fully understand the difference in outcomes between screening programs using single or double reading requires a randomized controlled trial. Future research may also investigate the effect of a second reader when using breast tomosynthesis. This study had limitations. First, some women recalled by one reader received only a reference standard of   BREAST IMAGING: Double Reading in Breast Cancer Screening Taylor-Phillips et al operate at a different standard or recall threshold if there is no second reader to pick up missed cancers and no arbitration to reduce false-positive recalls. Third, although performing this study in a trial setting minimized missing data, there remained some missing information about cancer characteristics. The results of the sensitivity analysis assuming that all missing data were extreme cases did not alter the overall results. Finally, while readers independently examined mammograms, they could access the decision of the first reader by examining notes. This is not a normal part of reading in a busy population screening program, but if it occurred would support our null hypothesis and underestimate the incremental value of a second reader (if second readers were aligning their results with that of the first reader, cancers detected by the second reader would not have different characteristics from those detected by the first reader).
In conclusion, in this large populationbased cohort study, the use of a second reader plus arbitration in mammography reduced recall rates and improved cancer detection. The extra cancers detected were smaller and lower grade and were less likely to be invasive or have involved nodes. Detecting these extra cancers may be associated with detecting important pathologic findings earlier, but it may also be associated with increased overdiagnosis from screening. Further analysis of follow-up data on outcomes is required to understand the balance of the benefits and harms of detecting these extra cancers. Policy makers should consider the overall harms and benefits when deciding whether to use a second reader, bearing in mind that a single reader might not perform the same way a first reader working as part of a team might.