Statistical Concepts Series

Measurement of Observer Agreement

Statistical measures are described that are used in diagnostic imaging for expressing observer agreement in regard to categorical data. The measures are used to characterize the reliability of imaging methods and the reproducibility of disease classifications and, occasionally with great care, as the surrogate for accuracy. The review concentrates on the chance-corrected indices, κ and weighted κ. Examples from the imaging literature illustrate the method of calculation and the effects of both disease prevalence and the number of rating categories. Other measures of agreement that are used less frequently, including multiple-rater κ, are referenced and described briefly.

© RSNA, 2003


  • 1 Baker JA, Kornguth PJ, Floyd CE. Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. AJR Am J Roentgenol 1996; 166:773-778.
  • 2 Markus JB, Somers S, Franic SE, et al. Interobserver variation in the interpretation of abdominal radiographs. Radiology 1989; 171:69-71.
  • 3 Tiitola M, Kivisaari L, Tervahartiala P, et al. Estimation or quantification of tumour volume? CT study on irregular phantoms. Acta Radiol 2001; 42:101-105.
  • 4 Polansky M. Agreement and accuracy: mixture distribution analysis. In: Beutel J, VanMeter R, Kundel H, eds. Handbook of imaging physics and perception. Bellingham, Wash: Society of Professional Imaging Engineers, 2000; 797-835.
  • 5 Henkelman RM, Kay I, Bronskill MJ. Receiver operating characteristic (ROC) analysis without truth. Med Decis Making 1990; 10:24-29.
  • 6 Agresti A. Categorical data analysis New York, NY: Wiley, 1990; 366-370.
  • 7 Fleiss JL. Statistical methods for rates and proportions 2nd ed. New York, NY: Wiley, 1981; 212-236.
  • 8 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33:159-174.
  • 9 Feinstein A, Cicchetti D. High agreement but low kappa. I. The problem of two paradoxes. J Clin Epidemiol 1990; 43:543-549.
  • 10 Cicchetti D, Feinstein A. High agreement but low kappa. II. Resolving the paradoxes. J Clin Epidemiol 1990; 43:551-558.
  • 11 Kundel HL, Gefter W, Aronchick J, et al. Relative accuracy of screen-film and computed radiography using hard and soft copy readings: a receiver operating characteristic analysis using bedside chest radiographs in a medical intensive care unit. Radiology 1997; 205:859-863.
  • 12 Epstein DM, Dalinka MK, Kaplan FS, et al. Observer variation in the detection of osteopenia. Skeletal Radiol 1986; 15:347-349.
  • 13 Herman PG, Khan A, Kallman CE, et al. Limited correlation of left ventricular end-diastolic pressure with radiographic assessment of pulmonary hemodynamics. Radiology 1990; 174:721-724.
  • 14 Taplin SH, Rutter CM, Elmore JG, et al. Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol 2000; 174:1257-1262.
  • 15 Robinson PJ, Wilson D, Coral A, et al. Variation between experienced observers in the interpretation of accident and emergency radiographs. Br J Radiol 1999; 72:323-330.
  • 16 Swets JA. Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 1986; 99:100-117.
  • 17 Uebersax JS. Modeling approaches for the analysis of observer agreement. Invest Radiol 1992; 27:738-743.
  • 18 Kundel HL, Polansky M. Mixture distribution and receiver operating characteristic analysis of bedside chest imaging using screen-film and computed radiography. Acad Radiol 1997; 4:1-7.
  • 19 Kundel HL, Polansky M. Comparing observer performance with mixture distribution analysis when there is no external gold standard. In: Kundel HL, eds. Medical imaging 1998: image perception. Bellingham, Wash: Society of Professional Imaging Engineers, 1998; 78-84.
  • 20 Birkelo CC, Chamberlain WE, Phelps PS, et al. Tuberculosis case finding: a comparison of the effectiveness of various roentgenographic and photofluorographic methods. JAMA 1947; 133:359-366.
  • 21 The “personal equation” in the interpretation of a chest roentgenogram (editorial).JAMA1947; 133:399-400.
  • 22 Elmore JG, Wells CK, Lee CH, et al. Variability in radiologists’ interpretation of mammograms. N Engl J Med 1994; 331:1493-1499.
  • 23 Kopans DB. Accuracy of mammographic interpretation (editorial). N Engl J Med 1994; 331:1521-1522.
  • 24 Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977; 33:363-374.
  • 25 Revesz G, Kundel HL, Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest Radiol 1983; 18:194-198.
  • 26 Hillman BJ, Hessel SJ, Swensson RG, Herman PG. Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. Invest Radiol 1977; 12:112-115.

Article History

Published in print: Aug 2003