When deciding whether to recall a woman for additional diagnostic examinations, experienced radiologists performed significantly better on average and, as important, more consistently in the clinic than in the laboratory when interpreting the same examinations.

Purpose

To compare radiologists' performance during interpretation of screening mammograms in the clinic with their performance when reading the same mammograms in a retrospective laboratory study.

Materials and Methods

This study was conducted under an institutional review board–approved, HIPAA-compliant protocol; the need for informed consent was waived. Nine experienced radiologists rated an enriched set of mammograms that they had personally read in the clinic (the “reader-specific” set) mixed with an enriched “common” set of mammograms that none of the participants had previously read in the clinic by using a screening Breast Imaging Reporting and Data System (BI-RADS) rating scale. The original clinical recommendations to recall the women for a diagnostic work-up, for both reader-specific and common sets, were compared with their recommendations during the retrospective experiment. The results are presented in terms of reader-specific and group-averaged sensitivity and specificity levels and the dispersion (spread) of reader-specific performance estimates.

Results

On average, the radiologists' performance was significantly better in the clinic than in the laboratory (P = .035). Interreader dispersion of the computed performance levels was significantly lower during the clinical interpretations (P < .01).

Conclusion

Retrospective laboratory experiments may not represent either expected performance levels or interreader variability during clinical interpretations of the same set of mammograms in the clinical environment well.

© RSNA, 2008

References

  • 1 DeLong ER , DeLong DM , Clarke-Pearson DL Comparing the area under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837–845.
  • 2 Dorfman DD , Berbaum KS , Metz CE Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992;27(9):723–731.
  • 3 Beiden SV , Wagner RF , Campbell G Components of variance models and multiple bootstrap experiments: an alternative method for random effects, receiver operating characteristics analysis. Acad Radiol 2000;7(5):341–349.
  • 4 Beiden SV , Wagner RF , Campbell G , Metz CE , Jiang Y Components-of-variance models for random-effects ROC analysis: the case of unequal variance structures across modalities. Acad Radiol 2001;8(7):605–615.
  • 5 Beiden SV , Wagner RF , Campbell G , Chan HP Analysis of uncertainties in estimates of components of variance in multivariate ROC analysis. Acad Radiol 2001;8(7):616–622.
  • 6 Wagner RF , Beiden SV , Campbell G , Metz CE , Sacks WM Assessment of medical imaging and computer-assist systems: lessons from recent experience. Acad Radiol 2002;9(11):1264–1277.
  • 7 Obuchowski NA , Beiden SV , Berbaum KS , et al. Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol 2004;11(9):980–995.
  • 8 Gur D , Rockette HE , Maitz GS , King JL , Klym AH , Bandos AI Variability in observer performance studies experimental observations. Acad Radiol 2005;12(12):1527–1533.
  • 9 Gur D , Rockette HE , Armfield DR , et al. Prevalence effect in a laboratory environment. Radiology 2003;228(1):10–14.
  • 10 Shah SK , McNitt-Gray MF , De Zoysa KR , et al. Solitary pulmonary nodule diagnosis on CT: results of an observer study. Acad Radiol 2005;12(4):496–501.
  • 11 Skaane P , Balleyguier C , Diekmann F , et al. Breast lesion detection and classification: comparison of screen-film mammography and full-field digital mammography with soft-copy reading—observer performance study. Radiology 2005;237(1):37–44.
  • 12 Shiraishi J , Abe H , Li F , Engelmann R , MacMahon H , Doi K Computer-aided diagnosis for the detection and classification of lung cancers on chest radiographs ROC analysis of radiologists' performance. Acad Radiol 2006;13(8):995–1003.
  • 13 Egglin TK , Feinstein AR Context bias: a problem in diagnostic radiology. JAMA 1996;276(21):1752–1755.
  • 14 Rutter CM , Taplin S Assessing mammographers' accuracy: a comparison of clinical and test performance. J Clin Epidemiol 2000;53(5):443–450.
  • 15 Elmore JG , Wells CK , Lee CH , Howard DH , Feinstein AR Variability in radiologists' interpretations of mammograms. N Engl J Med 1994;331(22):1493–1499.
  • 16 Beam CA , Layde PM , Sullivan DC Variability in the interpretation of screening mammograms by US radiologists: findings from a national sample. Arch Intern Med 1996;156(2):209–213.
  • 17 Elmore JG , Wells CK , Howard DH Does diagnostic accuracy in mammography depend on radiologists' experience? J Womens Health 1998;7(4):443–449.
  • 18 Esserman L , Cowley H , Eberle C , et al. Improving the accuracy of mammography: volume and outcome relationships. J Natl Cancer Inst 2002;94(5):369–375.
  • 19 Beam CA , Conant EF , Sickles EA Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003;95(4):282–290.
  • 20 American College of Radiology. Breast imaging reporting and data system atlas (BI-RADS atlas). Reston, Va: American College of Radiology, 2003. Available at: http://www.acr.org/SecondaryMainMenuCategories/quality_safety/BIRADSAtlas.aspx. Accessed October 1, 2003.
  • 21 Wagner RF , Metz CE , Campbell G Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol 2007;14(6):723–748.
  • 22 Gur D , Sumkin JH , Rockette HE , et al. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. J Natl Cancer Inst 2004;96(3):185–190.
  • 23 Gur D , Rockette HE , Good WF , et al. Effect of observer instruction on ROC study of chest images. Invest Radiol 1990;25(3):230–234.
  • 24 Grambsch PM Simple robust tests for scale differences in paired data. Biometrika 1994;81(2):359–372.
  • 25 Levene H Robust tests for equality of variances. In: Olkin I, ed. Contributions to probability and statistics. Palo Alto, Calif: Stanford University Press, 1960;278:–292.
  • 26 Beresford MJ , Padhani AR , Taylor NJ , et al. Inter- and intra-observer variability in the evaluation of dynamic breast cancer MRI. J Magn Reson Imaging 2006;24(6):1316–1325.
  • 27 Quality standards and certification requirements for mammography facilities—FDA. Interim rule with request for comments. Fed Regist 1993;58(243):67565–67572.
  • 28 Hardesty LA , Ganott MA , Hakim CM , Cohen CS , Clearfield RJ , Gur D in observer performance studies of mammograms “Memory effect” in observer performance studies of mammograms. Acad Radiol 2005;12(3):286–290.
  • 29 Pisano ED , Gatsonis C , Hendrick E , et al. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353(17):1773–1783.
  • 30 Fenton JJ , Taplin SH , Carney PA , et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med 2007;356(14):1399–1409.

Article History

Received November 19, 2007; revision requested December 21, 2007; final revision received January 17, 2008; accepted March 28.
Published in print: 2008