Reviews and CommentaryFree Access

Assessing Bone Age: A Paradigm for the Next Generation of Artificial Intelligence in Radiology

Published Online:

See also the article by Eng et al in this issue.

Dr Rubin is an adjunct professor of radiology at the New York University Grossman School of Medicine, president of All Pro Orthopedic Imaging Consultants, and a practicing musculoskeletal radiologist for Radsource. He is a fellow of the American College of Radiology and an associate editor of Radiology.

Dr Rubin is an adjunct professor of radiology at the New York University Grossman School of Medicine, president of All Pro Orthopedic Imaging Consultants, and a practicing musculoskeletal radiologist for Radsource. He is a fellow of the American College of Radiology and an associate editor of Radiology.

The diagnosis, monitoring, and treatment planning for several musculoskeletal conditions, including various endocrinopathies, abnormal stature, scoliosis, and limb length discrepancies, rely on an accurate assessment of skeletal maturity. Skeletal maturity is typically estimated with a hand and wrist radiograph. A common method uses an atlas developed by Greulich and Pyle (1). The atlas is based on serial examinations performed in 1000 healthy boys and girls in the Cleveland, Ohio, area from 1931 through 1942—a “big data” experiment performed in the precomputer age. The participants were White, born in the United States, and mostly of Northern European descent and high socioeconomic status. Plates in the atlas show the most representative image from 100 radiographs of children at the age and sex of the reference standard. A radiologist makes multiple subjective judgements comparing individual bones in a patient with those depicted in the reference standard. He or she then assigns a bone (skeletal) age, which is defined as the chronologic age at which children on whom the standards were based would attain the same degree of skeletal maturity (1).

Bone age determination seems like an ideal application for artificial intelligence (AI): It is based on one standardized posteroanterior radiograph. There is only one “diagnosis” (ie, the estimated bone age), unlike other applications where an algorithm used to detect pneumothorax could not fully evaluate a chest radiograph for nodules, infiltrates, or heart failure. The task is considered tedious and time consuming by many radiologists and requires experience to be an expert reader. Reliability and reproducibility are paramount, especially because sequential examinations are often performed in clinical practice. These reasons likely explain the multiple AI products already developed to estimate bone age (2) and the results of a 2017 Radiological Society of North America challenge that garnered 105 entries (3).

In this issue of Radiology, Eng et al (4) conducted a multi-institutional randomized investigation of an AI technique trained to predict bone age based on a ground truth established by a four-expert panel, applying a robust statistical analysis. Two unique and laudable features of the study stand out. First, the multicenter design involved 93 radiologists, simulating a real-world scenario instead of an artificial one that testing in one academic center might have produced. Second, the authors compared the accuracy attained by the radiologists with access to an AI-generated bone age compared with the same radiologists when they worked unaided. Radiologists were shown the AI results but then could accept or override them. This approach—rather than testing the accuracy of AI alone compared with that of the radiologist alone—mimics how most practices would apply this technology. In addition, this method addresses a fear of some radiologists that they may be replaced instead of enhanced by AI in the future.

The results showed that for five of the six included centers, the AI-aided method resulted in improved performance compared with performance without AI, with the mean absolute difference from the ground truth reduced from 6.0 to 5.4 months. The number of examinations assigned bone ages more than 12 months different from the ground truth also decreased in the AI-assisted scenario from 13.0% to 9.3%. Average interpretation time was 40 seconds faster in the AI-aided scenario.

Additionally, the interpretations provided by the radiologists working together with the AI input were more accurate than those that were assigned by AI alone, which had a mean absolute difference of 6.2 months. The authors provide several explanations. Essentially, most radiologists were more likely to overrule an inaccurate AI-assigned bone age rather than incorrectly change a correct one. Furthermore, AI and radiologist performance were complementary, with certain cases more accurately assessed by either a human or a machine.

The study did identify one center as an outlier, where results of the AI-aided interpretations were less accurate. The authors found that while the radiologists acting alone at this center outperformed their peers at the other locations, they were also more likely to modify an initial highly accurate estimate (at most 3 months different than ground truth) provided by the AI. Radiology practices embracing AI software must understand that individual behavior can potentially negate the benefits of AI. Remember: Your results may vary.

While there were significant improvements in accuracy and interpretation times, the actual magnitude of the improvement was small: A mean improvement compared with the ground truth of 0.6 months (18 days) is unlikely to be clinically relevant and may not justify the implementation of an AI tool for many practices. This is especially true in practices with low volumes of requested bone age examinations, where a 40-second savings a few times a day may not be meaningful.

The current study also discusses automation bias—the tendency of some radiologists to overly trust the AI even when it presents inaccurate estimates—which may result in time savings at the cost of decreased accuracy. The effect is somewhat analogous to that seen when other (non-AI) data are known before image interpretation. For example, observers first told the chronologic age of a patient are more likely to assign a bone age within 2 standard deviations of that age, compared with when the chronologic age is withheld (5). Again, users incorporating AI into their current or future practices need to be aware of this potential unconscious bias.

So, what are the logical next steps for developing AI radiology applications? Continuing advances in computer architecture and programming techniques will incrementally improve performance and speed, although one can argue that the current accuracy of many AI algorithms (or ensembles of algorithms) already exceeds what is needed clinically. Does an error rate of 5.4 months for assigning bone age affect clinical care when the standard deviation for healthy children older than 3.5 years (1) already exceeds this amount? The ability of AI-assisted techniques to identify meaningful changes in sequential examinations still needs to be proven, but given the current accuracy, it is very likely that sensitivity to change will at least equal that available with human interpretations.

I believe it is time to think about eliminating the human-based ground truth for future applications. While expert consensus was a necessary initial step in evaluating new algorithms, it is possible that some AI already outperforms radiologists, but current study design (using a human-based reference standard) makes that impossible to show. In essence, we are not training algorithms to find the most correct answer but rather to best predict what the radiologist-based diagnosis would be. Recently, Pan et al (6) showed that AI could be trained and validated using radiographs obtained from a diverse pediatric trauma population, where each patient’s chronologic age was used as the ground truth.

One major critique of using the Greulich and Pyle atlas as a reference standard is that it may not be equally applicable to children of different ethnic and racial backgrounds (7,8). Another is that changes in nutrition, physical fitness, and overall health may mean that normal ranges present in the 1930s no longer adequately apply to current populations. Why not leverage AI to develop new standards, using collections of radiographs obtained from otherwise healthy children for nonendocrine indications? The goal would be to predict each child’s actual chronologic age on the day of radiography, not the bone age assigned by a cohort of expert radiologists. With current computing power, it would be possible to process thousands of images divided into groups by ethnicity, geography, or other factors to establish multiple new norms, in a small fraction of the 12 years needed to create the original data used for the Greulich and Pyle atlas.

Large digital databases provide an opportunity to develop AI that moves beyond predicting how radiologists would interpret an image. For example, a recent study (9) investigated how AI could be trained not only to predict the Kellgren-Lawrence grade of knee osteoarthritis assigned by radiologists but also to prognosticate the risk of future joint replacement based on a knee radiograph. Only by shedding the limitations imposed by training using a human-based ground truth can researchers develop applications that will enable clinically relevant forecasts that are currently beyond the abilities of non–AI-aided radiologists.

Disclosures of Conflicts of Interest: D.A.R. is on the ImageBiopsy Lab medical advisory board; is an associate editor of Radiology.


  • 1. Greulich WW, Pyle SL. Radiographic atlas of skeletal development of the hand and wrist. 2nd ed. Stanford, Calif:Stanford University Press,1959.
  • 2. Dallora AL, Anderberg P, Kvist O, Mendes E, Diaz Ruiz S, Sanmartin Berglund J. Bone age assessment with various machine learning techniques: A systematic literature review and meta-analysis. PLoS One 2019;14(7):e0220242.
  • 3. Halabi SS, Prevedello LM, Kalpathy-Cramer J, et al. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology 2019;290(2):498–503.
  • 4. Eng DK, Khandwala NB, Long J, et al. Artificial intelligence algorithm improves radiologist performance in skeletal age assessment: a prospective multicenter randomized controlled trial. Radiology 2021. Published online September 28, 2021.
  • 5. Berst MJ, Dolan L, Bogdanowicz MM, Stevens MA, Chow S, Brandser EA. Effect of knowledge of chronologic age on the variability of pediatric bone age determined using the Greulich and Pyle standards. AJR Am J Roentgenol 2001;176(2):507–510.
  • 6. Pan I, Baird GL, Mutasa S, et al. Rethinking Greulich and Pyle: A Deep Learning Approach to Pediatric Bone Age Assessment Using Pediatric Trauma Hand Radiographs. Radiol Artif Intell 2020;2(4):e190198.
  • 7. Zhang A, Sayre JW, Vachon L, Liu BJ, Huang HK. Racial differences in growth patterns of children assessed on the basis of bone age. Radiology 2009;250(1):228–235.
  • 8. Alshamrani K, Messina F, Offiah AC. Is the Greulich and Pyle atlas applicable to all ethnicities? A systematic review and meta-analysis. Eur Radiol 2019;29(6):2910–2923.
  • 9. Leung K, Zhang B, Tan J, et al. Prediction of Total Knee Replacement and Diagnosis of Osteoarthritis by Using Deep Learning on Knee Radiographs: Data from the Osteoarthritis Initiative. Radiology 2020;296(3):584–593.

Article History

Received: May 25 2021
Revision requested: June 14 2021
Revision received: June 15 2021
Accepted: June 17 2021
Published online: Sept 28 2021
Published in print: Dec 2021