Statistical Concepts Series

Measurement of Observer Agreement

Statistical measures are described that are used in diagnostic imaging for expressing observer agreement in regard to categorical data. The measures are used to characterize the reliability of imaging methods and the reproducibility of disease classifications and, occasionally with great care, as the surrogate for accuracy. The review concentrates on the chance-corrected indices, κ and weighted κ. Examples from the imaging literature illustrate the method of calculation and the effects of both disease prevalence and the number of rating categories. Other measures of agreement that are used less frequently, including multiple-rater κ, are referenced and described briefly.

© RSNA, 2003


  • 1 Baker JA, Kornguth PJ, Floyd CE. Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. AJR Am J Roentgenol 1996; 166:773-778. Crossref, MedlineGoogle Scholar
  • 2 Markus JB, Somers S, Franic SE, et al. Interobserver variation in the interpretation of abdominal radiographs. Radiology 1989; 171:69-71. LinkGoogle Scholar
  • 3 Tiitola M, Kivisaari L, Tervahartiala P, et al. Estimation or quantification of tumour volume? CT study on irregular phantoms. Acta Radiol 2001; 42:101-105. Crossref, MedlineGoogle Scholar
  • 4 Polansky M. Agreement and accuracy: mixture distribution analysis. In: Beutel J, VanMeter R, Kundel H, eds. Handbook of imaging physics and perception. Bellingham, Wash: Society of Professional Imaging Engineers, 2000; 797-835. Google Scholar
  • 5 Henkelman RM, Kay I, Bronskill MJ. Receiver operating characteristic (ROC) analysis without truth. Med Decis Making 1990; 10:24-29. Crossref, MedlineGoogle Scholar
  • 6 Agresti A. Categorical data analysis New York, NY: Wiley, 1990; 366-370. Google Scholar
  • 7 Fleiss JL. Statistical methods for rates and proportions 2nd ed. New York, NY: Wiley, 1981; 212-236. Google Scholar
  • 8 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33:159-174. Crossref, MedlineGoogle Scholar
  • 9 Feinstein A, Cicchetti D. High agreement but low kappa. I. The problem of two paradoxes. J Clin Epidemiol 1990; 43:543-549. Google Scholar
  • 10 Cicchetti D, Feinstein A. High agreement but low kappa. II. Resolving the paradoxes. J Clin Epidemiol 1990; 43:551-558. Crossref, MedlineGoogle Scholar
  • 11 Kundel HL, Gefter W, Aronchick J, et al. Relative accuracy of screen-film and computed radiography using hard and soft copy readings: a receiver operating characteristic analysis using bedside chest radiographs in a medical intensive care unit. Radiology 1997; 205:859-863. LinkGoogle Scholar
  • 12 Epstein DM, Dalinka MK, Kaplan FS, et al. Observer variation in the detection of osteopenia. Skeletal Radiol 1986; 15:347-349. Crossref, MedlineGoogle Scholar
  • 13 Herman PG, Khan A, Kallman CE, et al. Limited correlation of left ventricular end-diastolic pressure with radiographic assessment of pulmonary hemodynamics. Radiology 1990; 174:721-724. LinkGoogle Scholar
  • 14 Taplin SH, Rutter CM, Elmore JG, et al. Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol 2000; 174:1257-1262. Crossref, MedlineGoogle Scholar
  • 15 Robinson PJ, Wilson D, Coral A, et al. Variation between experienced observers in the interpretation of accident and emergency radiographs. Br J Radiol 1999; 72:323-330. Crossref, MedlineGoogle Scholar
  • 16 Swets JA. Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 1986; 99:100-117. Crossref, MedlineGoogle Scholar
  • 17 Uebersax JS. Modeling approaches for the analysis of observer agreement. Invest Radiol 1992; 27:738-743. Crossref, MedlineGoogle Scholar
  • 18 Kundel HL, Polansky M. Mixture distribution and receiver operating characteristic analysis of bedside chest imaging using screen-film and computed radiography. Acad Radiol 1997; 4:1-7. Crossref, MedlineGoogle Scholar
  • 19 Kundel HL, Polansky M. Comparing observer performance with mixture distribution analysis when there is no external gold standard. In: Kundel HL, eds. Medical imaging 1998: image perception. Bellingham, Wash: Society of Professional Imaging Engineers, 1998; 78-84. Google Scholar
  • 20 Birkelo CC, Chamberlain WE, Phelps PS, et al. Tuberculosis case finding: a comparison of the effectiveness of various roentgenographic and photofluorographic methods. JAMA 1947; 133:359-366. Crossref, MedlineGoogle Scholar
  • 21 The “personal equation” in the interpretation of a chest roentgenogram (editorial).JAMA1947; 133:399-400. CrossrefGoogle Scholar
  • 22 Elmore JG, Wells CK, Lee CH, et al. Variability in radiologists’ interpretation of mammograms. N Engl J Med 1994; 331:1493-1499. Crossref, MedlineGoogle Scholar
  • 23 Kopans DB. Accuracy of mammographic interpretation (editorial). N Engl J Med 1994; 331:1521-1522. Crossref, MedlineGoogle Scholar
  • 24 Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977; 33:363-374. Crossref, MedlineGoogle Scholar
  • 25 Revesz G, Kundel HL, Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest Radiol 1983; 18:194-198. Crossref, MedlineGoogle Scholar
  • 26 Hillman BJ, Hessel SJ, Swensson RG, Herman PG. Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. Invest Radiol 1977; 12:112-115. Crossref, MedlineGoogle Scholar

Article History

Published in print: Aug 2003