Original ResearchFree Access

Artificial Intelligence Algorithm Improves Radiologist Performance in Skeletal Age Assessment: A Prospective Multicenter Randomized Controlled Trial

Published Online:https://doi.org/10.1148/radiol.2021204021

Abstract

Background

Previous studies suggest that use of artificial intelligence (AI) algorithms as diagnostic aids may improve the quality of skeletal age assessment, though these studies lack evidence from clinical practice.

Purpose

To compare the accuracy and interpretation time of skeletal age assessment on hand radiograph examinations with and without the use of an AI algorithm as a diagnostic aid.

Materials and Methods

In this prospective randomized controlled trial, the accuracy of skeletal age assessment on hand radiograph examinations was performed with (n = 792) and without (n = 739) the AI algorithm as a diagnostic aid. For examinations with the AI algorithm, the radiologist was shown the AI interpretation as part of their routine clinical work and was permitted to accept or modify it. Hand radiographs were interpreted by 93 radiologists from six centers. The primary efficacy outcome was the mean absolute difference between the skeletal age dictated into the radiologists’ signed report and the average interpretation of a panel of four radiologists not using a diagnostic aid. The secondary outcome was the interpretation time. A linear mixed-effects regression model with random center- and radiologist-level effects was used to compare the two experimental groups.

Results

Overall mean absolute difference was lower when radiologists used the AI algorithm compared with when they did not (5.36 months vs 5.95 months; P = .04). The proportions at which the absolute difference exceeded 12 months (9.3% vs 13.0%, P = .02) and 24 months (0.5% vs 1.8%, P = .02) were lower with the AI algorithm than without it. Median radiologist interpretation time was lower with the AI algorithm than without it (102 seconds vs 142 seconds, P = .001).

Conclusion

Use of an artificial intelligence algorithm improved skeletal age assessment accuracy and reduced interpretation times for radiologists, although differences were observed between centers.

Clinical trial registration no. NCT03530098

© RSNA, 2021

Online supplemental material is available for this article.

See also the editorial by Rubin in this issue.

Summary

Use of an artificial intelligence algorithm as a diagnostic aid improved the quality of skeletal age assessment in a prospective multicenter setting.

Key Results

  • ■ Skeletal age was assessed on hand radiographs by 93 radiologists at six centers without (n = 739 radiographs) and with (n = 792) an artificial intelligence (AI) algorithm. Comparison was made to a reference panel of four experts.

  • ■ Use of the AI algorithm showed a smaller difference in skeletal age compared with the reference panel (5.4 months with AI vs 6.0 months without AI, P = .04). This improvement was seen at five of the six radiology centers.

Introduction

The accurate determination of a child’s developmental status is required for proper treatment of various growth disorders (1) and scoliosis (2). Other parameters, such as height, weight, secondary sexual characteristics, chronologic age, and dental age, correlate with developmental status, but skeletal age has been considered the most reliable method (35). The standard of care for this assessment calls for radiologists to identify the reference standard in an atlas of hand radiographs that most closely resembles an anteroposterior or posteroanterior radiograph of the participant’s left hand. The most common atlas used as a reference standard is the Radiographic Atlas of Skeletal Development of the Hand and Wrist, published in 1959 (6).

As part of the process of implementing an artificial intelligence (AI) algorithm in clinical practice, it is critical to properly determine its effects. However, different study designs may yield different findings about the same assistive technologies. For example, the same commercially available computer-aided detection system for detecting pulmonary nodules on chest CT scans produced different findings in studies completed within a year of each other (79). Findings on potential computer-aided diagnosis systems for mammography were inconsistent and contradictory for more than a decade before definitive evaluations found no established benefit to women (1012). Evaluations that suggested computer-aided detection improved diagnostic accuracy of mammography were designed as reader studies of enriched case sets, prospective sequential reading clinical studies, and retrospective observational studies using historical controls (12). One of the first evaluations to dispute the benefits was a prospective multicenter randomized trial (13).

Despite many examples in medical literature demonstrating the potential applications for AI in medicine (1419), few studies have evaluated performance in a prospective or multicenter setting. In a meta-analysis of 516 studies published between January and August 2018 about the performance of AI algorithms analyzing medical images, none of the studies adopted a diagnostic cohort design in a prospective multicenter setting (20).

Previous studies have suggested that using an AI algorithm as a diagnostic aid may improve the quality of skeletal age assessment (2123). Evaluation of the algorithm in representative clinical conditions would advance understanding of its true benefits and harms (ie, deferral to an inaccurate AI interpretation or overriding an accurate AI interpretation).

The purpose of this study was to determine whether the use of an AI algorithm as a diagnostic aid for radiologists assessing the skeletal maturity of pediatric participants improved their accuracy and interpretation time, compared with the standard of care, using a superiority trial design in a prospective multicenter setting.

Materials and Methods

Trial Oversight

We conducted the study from September 2018 through August 2019 in six radiology departments in the United States: Harvard Medical School and Boston Children’s Hospital, Cincinnati Children’s Hospital Medical Center, Children’s Hospital of Philadelphia, New York University School of Medicine, Stanford University School of Medicine, and Yale University School of Medicine. We refer to Stanford University School of Medicine as the reference center and the remaining centers as centers 1 through 5.

The institutional review boards at all centers approved the study. Verbal informed consent was obtained from each radiologist whose interpretations were included in the study, and consent was waived for each pediatric participant whose hand radiograph was interpreted in the study, in accordance with the institutional review board decision at all centers. A total of 93 radiologists were included in this study. No regulatory clearance or approval from the Food and Drug Administration was sought or obtained for the AI algorithm; the study is investigational device exempt under 21 CFR 812 §812.2 (c).

Algorithm Training

The AI algorithm used at all centers in this study was trained using deep learning methods on the open-source training data set released for the Radiological Society of North America (RSNA) Pediatric Bone Age Machine Learning Challenge, which was composed of radiographs obtained from Children’s Hospital Colorado (Aurora, Co) and Lucile Packard Children’s Hospital Stanford (Palo Alto, Calif) (24). The training data set consisted of 12611 hand radiographs, including 3477 hand radiographs from Lucile Packard Children’s Hospital Stanford. A total of 5778 hand radiographs were of girls and 6833 were of boys. The mean chronologic age was 10 years 7 months. Additional details about the composition of the training data set are available in the original study (24).

Before evaluating the model as a diagnostic aid, the trained AI algorithm was evaluated on the open-source test data set released for the RSNA challenge. The original test data set consisted of 200 hand radiographs that were withheld from the training set from Lucile Packard Children’s Hospital Stanford. Additional details about the composition of the test data set are also available in the original study (24).

The AI algorithm was designed with a similar architecture as the winning algorithm in the RSNA challenge, using both pixel and sex information (25). The Inception V3 architecture was applied to the pixel information (resized to a consistent image size of 500 × 500 pixels) and concatenated with the binarized participant sex before being passed through several fully connected layers to output the predicted skeletal age (in months) (26). The algorithm was trained using Adam optimization (β1 = 0.9, β2 = 0.999, ε = 1 × 10−8), with an initial learning rate of 1 × 10−4, using a mean absolute error regression loss between the numeric prediction and the label. Data augmentation with random rotations, distortions, zooms, and flips was applied to improve the robustness of the AI algorithm. The AI algorithm was implemented using TensorFlow (https://www.tensorflow.org/) and Keras (https://keras.io/).

Trial Design and Implementation

The trial was designed as a prospective, multicenter randomized controlled clinical trial (ClinicalTrials.gov registration, NCT03530098). We included radiologists who interpreted pediatric skeletal age examinations and verbally consented to participate. We included participating radiologists’ interpretations for examinations that contained a procedure code or study description indicative of a skeletal age examination. Examinations that met the criteria were forwarded by the site’s picture archiving and communication system to a virtual machine provisioned on site. Before initiation of the trial, we installed software on the virtual machine that received Digital Imaging and Communications in Medicine (DICOM) C-STORE requests and randomly assigned in a 1:1 ratio to have the radiology report that accompanied the examination prepopulated (with-AI group) or not (control group) with the AI estimate. Additional visual aids, such as saliency maps, were not shown because these visual aids may artificially improve the performance of the radiologist, independent of the effect of the AI algorithm. Participating radiologists were not blinded to chronologic age in either the with-AI or control group to best simulate the prospective setting. Image characteristics, including shape, detector, distance to source, exposure, field of view, and manufacturer, were extracted from DICOM metadata.

Participating radiologists were instructed to read a script before the study (Appendix E1 [online]). The script addressed frequently asked questions and stipulated expectations pertaining to proper use of the AI algorithm for the with-AI group.

Examinations containing more than one radiograph were not randomized because the instance containing the proper anteroposterior or posteroanterior hand radiograph expected by the AI algorithm could not be verified in real time. No further exclusion criteria were applied based on image quality metrics or manufacturers. No exclusion criteria were applied based on participant chronologic age. Participating radiologists were not blinded to the trial group assignments because prepopulated AI estimates were present in their radiology reports for examinations randomly assigned to the with-AI group (Movie 1 [online]). Simple randomization was performed with the Python random package (https://docs.python.org/3/library/random.html). Before primary analysis, we excluded the participating radiologists’ interpretations for examinations for which a trainee provided a preliminary interpretation. There were no other requirements regarding the experience of radiologists.

Movie E1: Examinations that met the criteria were forwarded by the site’s Picture Archiving and Communication System (PACS) to an virtual machine (VM) provisioned on-premise. Prepopulated AI estimates were presented in participating radiologists’ radiology reports for examinations randomly assigned to the With-AI group, as shown in the video. Additional visual aids, such as saliency maps, were not shown since these visual aids may artificially improve the performance of the radiologist, independent of the effect of the AI algorithm.

The trial was ended at a center when 300 examinations were enrolled (before applying exclusion criteria) or the total sample size determined in the power calculation was reached.

During the design phase of the trial, statistical power was estimated for testing the group difference on the primary efficacy outcomes from 1000 simulated data sets. Each data set with a sample size of 1600 examinations was analyzed using a multilevel regression for a Wald test with a significance level of .05. The test achieved a power of 0.92 to detect a difference of 0.62 months in the primary outcome between the with-AI and control groups (Appendix E2 [online], Table E1 [online]). The actual trial size of 1531 examinations addressed our primary objective with a power of 0.89.

Outcomes

The primary efficacy outcome was the mean absolute difference (MAD) between the skeletal age dictated into the radiologists’ signed report and the average interpretation of a panel of four radiologists not using a diagnostic aid. Radiologists involved in ground truth labeling were blinded to the trial group assignment of the examinations they were labeling and to each other. Data from previous studies supported this approach for establishing the ground truth (Appendix E3a [online]; Figs E1–E2 [online]), and initial data from this study further corroborated this decision (Appendix E3b [online]). A digital atlas was used to ensure standardization for ground truth labeling (Fig E3 [online]).

For each radiograph, panelists for ground truth used the digital atlas to either provide a label or indicate that the radiograph contained insufficient information to provide a label. Panelists were shown a standardized instructional video for using the digital atlas (Movie E2 [online]). Radiographs for which three or four of the panelists indicated that the radiograph contained insufficient information were excluded from primary analysis. Three radiographs depicting deformed hands and one radiograph depicting a knee were excluded from primary analysis. Other radiographs were relabeled by one or two adjudicators who were not part of the original panel. Agreement between panelists was verified (Table E2 [online]).

Movie E2: For each radiograph, panelists for ground truth used the digital atlas to either provide a label or indicate that the radiograph contained insufficient information to provide a label. Panelists were shown a standardized instructional video for using the digital atlas, as shown in the video.

The secondary outcome was the median interpretation time. To complete each interpretation, the participating radiologist opened and closed the radiology report at least once. The duration of each session was collected automatically from the system time stamps and summed to determine the interpretation time. Median interpretation time was used to reduce the effect of extreme outliers.

Statistical Analysis

Means and 95% CIs are presented for continuous variables, whereas numbers and proportions are presented for categorical variables. All unadjusted comparisons between the with-AI and control groups were performed using a Student t test or Wilcoxon rank sum test for continuous variables and a χ2 test or Fisher exact test for categorical variables. All statistical tests, where applicable, are two sided. All statistical analyses used SAS software (version 9.3; SAS Institute).

To evaluate the difference in our primary outcomes between with-AI and control groups, accounting for variabilities among centers and radiologists, a linear mixed-effects regression model was used with random center-level and radiologist-level effects (Appendix E4 [online]). From the mixed model, both overall and center-level differences between the two groups were estimated.

Results

Retrospective AI Evaluation

The performance of the trained AI algorithm was first evaluated on the open-source test data set released for the RSNA Pediatric Bone Age Machine Learning Challenge (24), which consisted of 200 hand radiographs. In this retrospective evaluation, an MAD of 4.9 months between the skeletal age predicted by the algorithm and the average interpretation of a panel of six radiologists was achieved, consistent with the performance of the set of winning algorithms in the RSNA challenge (24).

Standalone AI Evaluation

The performance of the AI algorithm in the prospective setting at the reference center did not differ from its performance on the RSNA challenge test data set (MAD, 5.3 months [prospective] vs 4.9 months [retrospective]; P = .37; Fig E4 [online]). The overall standalone performance of the AI algorithm across with-AI and control groups deteriorated at center 2 external to the reference center but remained consistent at centers 1, 3, 4, and 5, where it did not differ from the reference center (center 1, 5.5 [P = .27]; center 2, 6.4 [P = .04]; center 3, 5.9 [P = .09]; center 4, 6.0 [P = .10]; center 5, 5.1 [P = .82]; Fig 1).

Prospective artificial intelligence (AI) performance at external                         centers. Prospective performance at the reference center (n = 212) was                         compared with center 1 (n = 284), center 2 (n = 299), center 3 (n = 291),                         center 4 (n = 300), and center 5 (n = 145). The overall standalone                         performance of the AI algorithm across with-AI and control groups was worse                         at center 2 external to the reference center but remained consistent at                         centers 1, 3, 4, and 5, where it did not differ significantly (center 1, P =                         .27; center 2, P = .04; center 3, P = .09; center 4, P = .10; center 5, P =                         .82). The middle bar is positioned at the median, and the lower and upper                         bars are positioned at Q1–1.5 × IQR and Q3 + 1.5 × IQR,                         respectively. IQR = interquartile range, MAD = mean absolute difference, Q1                         = first quartile, Q3 = third quartile.

Figure 1: Prospective artificial intelligence (AI) performance at external centers. Prospective performance at the reference center (n = 212) was compared with center 1 (n = 284), center 2 (n = 299), center 3 (n = 291), center 4 (n = 300), and center 5 (n = 145). The overall standalone performance of the AI algorithm across with-AI and control groups was worse at center 2 external to the reference center but remained consistent at centers 1, 3, 4, and 5, where it did not differ significantly (center 1, P = .27; center 2, P = .04; center 3, P = .09; center 4, P = .10; center 5, P = .82). The middle bar is positioned at the median, and the lower and upper bars are positioned at Q1–1.5 × IQR and Q3 + 1.5 × IQR, respectively. IQR = interquartile range, MAD = mean absolute difference, Q1 = first quartile, Q3 = third quartile.

Image characteristics, including shape, detector, distance to source, exposure, field of view, and manufacturer, were well varied across centers (Table E3 [online]), and hand radiographs with severe insets, rotations, or anatomic disorders were observed to negatively affect the performance of the AI algorithm (Fig E5 [online]). Performance of the AI algorithm did not differ between boys and girls (Fig E6 [online]).

Clinical Trial Evaluation

Of 1922 consecutive skeletal age assessment examinations assessed for eligibility, 1903 were randomized (964 in the with-AI group and 939 in the control group) at six centers. Of the randomized examinations, 1531 were considered for primary analysis. Randomized examinations were excluded from primary analysis if a trainee provided a preliminary interpretation, the interpreting radiologist was not participating in this study, or the radiograph contained insufficient information to provide a skeletal age label as determined by three or four panelists (Fig 2). Baseline demographics of examinations used for primary analysis did not differ between with-AI and control groups (Table 1).

Participant flowchart for inclusion. Of 1922 consecutive skeletal age                         assessment examinations assessed for eligibility between September 2018 and                         August 2019, 1903 were randomized (964 in the with-AI [artificial                         intelligence] group and 939 in the control group) at six centers.

Figure 2: Participant flowchart for inclusion. Of 1922 consecutive skeletal age assessment examinations assessed for eligibility between September 2018 and August 2019, 1903 were randomized (964 in the with-AI [artificial intelligence] group and 939 in the control group) at six centers.

Table 1: Baseline Characteristics of Participants for Primary Analysis

Table 1:

Analysis of the primary outcome showed lower MAD between the skeletal age dictated into the radiologists’ signed report and the average interpretation of a panel of four radiologists in the with-AI group versus the control group (5.36 months vs 5.95 months, P = .04) (Table 2). The proportions that the absolute difference exceeded 12 months and 24 months were lower in the with-AI group than in the control group (12 months [9.3% vs 13.0%; P = .02] and 24 months [0.5% vs 1.8%; P = .02]) (Table 2). The absolute difference adjusted for the standard deviation at the participant’s chronologic age also was lower in the with-AI group than in the control group for both the Greulich and Pyle and the Brush Foundation standard deviations (Greulich and Pyle, P = .02; Brush Foundation, P = .03; Table 2).

Table 2: Primary and Secondary Outcomes in a Study of the Effect of AI Used as a Diagnostic Aid for Skeletal Age Assessment versus Current Standard of Care

Table 2:

In an adjusted mixed-effects model, the primary outcome showed lower diagnostic error in the with-AI group than in the control group (Fig 3).

Mixed-effects model. A linear mixed-effects regression model was used                         with random center-level and radiologist-level effects to evaluate the                         difference of our primary outcomes between with–artificial                         intelligence (AI) and control groups, accounting for variabilities among                         centers and radiologists. From the mixed model, both overall and                         center-level differences between the two groups were estimated. Primary                         outcomes were significant. For each center, unadjusted differences (solid                         lines) were compared with adjusted differences produced by the mixed-effects                         model (dotted lines) for with-AI (●) and control (○) groups.                         MAD = mean absolute difference.

Figure 3: Mixed-effects model. A linear mixed-effects regression model was used with random center-level and radiologist-level effects to evaluate the difference of our primary outcomes between with–artificial intelligence (AI) and control groups, accounting for variabilities among centers and radiologists. From the mixed model, both overall and center-level differences between the two groups were estimated. Primary outcomes were significant. For each center, unadjusted differences (solid lines) were compared with adjusted differences produced by the mixed-effects model (dotted lines) for with-AI (●) and control (○) groups. MAD = mean absolute difference.

Although the radiologists and AI algorithm performed the same in isolation (MAD, 6.0 months [radiologist] vs 6.2 months [AI]; P = .51) (Fig E7 [online]) use of the AI algorithm significantly improved the quality of the radiologists’ final impressions. Several behaviors produced this overall improvement. In all experiments, accurate predictions deviated at most 6 months from the ground-truth label, and inaccurate predictions deviated at least 12 months from the ground-truth label. First, radiologists deferred to accurate AI predictions presented to them more often than they deferred to inaccurate predictions presented to them (deferred to accurate predictions [70.0%] vs deferred to inaccurate predictions [42.9%]; P < .001). Second, radiologists improved inaccurate AI predictions that were presented to them more often than they worsened accurate predictions presented to them (improved inaccurate predictions [51.7%] vs worsened accurate predictions [17.2%]; P < .001). Third, the AI algorithm and radiologists performed well in different situations, meaning there were examinations for which AI predictions were accurate and radiologists were inaccurate and vice versa (Fig E8 [online]). Qualitatively, sample examinations show both of these situations (Fig 4). Fourth, the AI algorithm produced accurate predictions more often than it produced inaccurate predictions (accurate predictions [66.0%] vs inaccurate predictions [11.5%]; P < .001). Fifth, though the radiologists and AI algorithm performed the same in isolation, radiologist supervision improved the AI algorithm performance (MAD, with radiologist [5.4 months] vs control AI [6.2 months]; P = .01).

Sample examinations. (A) Computed radiograph in a girl randomly                         assigned to the control group at center 2 with low artificial intelligence                         (AI) error and high radiologist error (AI = 46 months, chronologic age = 76                         months, radiologist = 69 months, panel = 41 months). (B) Computed radiograph                         in a boy randomly assigned to the control group at center 4 with high AI                         error and low radiologist error (AI = 126 months, chronologic age = 140                         months, radiologist = 105 months, panel = 100.5 months).

Figure 4: Sample examinations. (A) Computed radiograph in a girl randomly assigned to the control group at center 2 with low artificial intelligence (AI) error and high radiologist error (AI = 46 months, chronologic age = 76 months, radiologist = 69 months, panel = 41 months). (B) Computed radiograph in a boy randomly assigned to the control group at center 4 with high AI error and low radiologist error (AI = 126 months, chronologic age = 140 months, radiologist = 105 months, panel = 100.5 months).

Although use of the AI algorithm resulted in significantly lower diagnostic error overall, this effect varied across all centers (test for variation from the mixed-effects model, P = .007). At the outlier center 5, use of AI did not result in lower diagnostic error (MAD, with-AI [5.3 months] vs control [4.8 months]; P = .21) (Fig E9 [online]). There were two behaviors that emerged at the outlier center that differed from the other centers. Highly accurate AI predictions deviated by at most 3 months from the ground-truth label. First, radiologists at center 5 worsened highly accurate AI predictions presented to them more often than radiologists at other centers (center 5 [40.6%] vs other centers [21.2%]; P = .01) (Fig E10 [online]). The proportion at which highly accurate predictions occurred did not differ between the with-AI group at center 5 versus other centers (center 5 [47.1%] vs other centers [39.8%]; P = .24). Second, radiologists working in isolation at center 5 outperformed radiologists at other centers (MAD, center 5 [4.8 months] vs other centers [6.1 months]; P = .04) (Fig E11 [online]).

The AI algorithm resulted in higher diagnostic error when inaccurate AI predictions were presented to radiologists in the with-AI group compared with when inaccurate predictions were not presented to them in the control group, an indication of automation bias, though this result was not quite significant (MAD, 10.9 months [with-AI] vs 9.4 months [control]; P = .06) (Fig E12 [online]).

Analysis of the secondary outcome showed lower median time spent in the with-AI group than in the control group (102 seconds [with-AI] vs 142 seconds [control]; P < .001) (Table 2).

Discussion

Technical advancements in the form of deep learning methods offer renewed optimism for the potential of diagnostic aids in radiology. Unsurprisingly, the performance of the artificial intelligence (AI) algorithm in the prospective setting at the reference center did not differ from its performance on the Radiological Society of North America challenge test data set, which consisted of retrospective radiographs from the reference center (5.3 months vs 4.9 months; P = .37). The overall standalone performance of the AI algorithm across with-AI and control groups deteriorated at center 2 external to the reference center but remained consistent at centers 1, 3, 4, and 5 (center 1, P = .27; center 2, P = .04; center 3, P = .09; center 4, P = .10; center 5, P = .82), which offers reason for optimism that some AI algorithms may generalize to centers external to those that contributed to their training set without refinement, in contrast with prior evidence (2729).

Our findings substantiate the importance of prospective multicenter randomized trials in the design of evaluation studies for AI algorithms. First, randomization into experimental and control groups allowed for the isolation of interactive effects between human radiologists and the AI algorithm. Despite the well-accepted strengths of a randomized and controlled trial design (30), regulatory and reimbursement pathways are increasingly supporting real-world evidence drawn from real-world data, as part of an effort to reduce trial costs (31). It remains unclear to what extent an uncontrolled trial design leveraging real-world evidence would have identified patterns based on the comparison with a control group, including the divergent behaviors at center 5.

Second, the prospective aspect of this study mitigated the influence of laboratory effects. Gur et al (32) concluded that retrospective laboratory experiments may not represent expected performance levels or interreader variability during interpretations of the same set of studies in the clinical environment, suggesting the importance of reproducing true clinical conditions.

Third, the multicenter nature of this study captured diverse clinical environments and behaviors. The effect of using an AI algorithm as a diagnostic aid at center 5 underscores the potential for an AI algorithm to result in higher diagnostic error for some examinations, radiologists, or centers, even if it might result in lower diagnostic error overall. The AI algorithm resulting in higher diagnostic error when inaccurate AI predictions were presented to radiologists in the with-AI group is an indication of automation bias. The effect of automation bias varied among participating radiologists. For some radiologists, increased acceptance of AI predictions co-occurred with diagnostic decline, whereas for other radiologists, increased acceptance co-occurred with diagnostic improvement (Fig E13 [online]). Whether a lower diagnostic error overall justifies a higher diagnostic error for certain examinations or radiologists remains an important ethical consideration for the use of any diagnostic aid.

The presence of automation bias may call into question the potential benefits of reduction in interpretation time. It seems plausible for radiologists influenced by automation bias to accept AI predictions without adequate scrutiny or even persuade themselves to accept AI predictions without adequate justification. This behavior may improve interpretation time at the expense of diagnostic error, raising concerns that group-level difference in median interpretation time was a side effect of automation bias. However, it is likely the reduction in interpretation time would still exist in the absence of automation bias (Appendix E5 [online]).

This study had limitations. First, the primary outcomes measured the diagnostic error of participating radiologists, not participant outcomes. Because of the multitude of unrelated clinical end points for the skeletal age assessment, measuring participant outcomes instead of diagnostic error of participating radiologists would have limited the eligibility criteria to a subset of the possible clinical end points. Furthermore, defining primary outcome measures in terms of diagnostic error of participating radiologists aligned with the broader ambition to improve collective understanding about the effect of AI on diagnostic medicine. Second, this trial had an unblinded design; a blinded trial design would have blinded radiologists to the presence or absence of the AI estimate prepopulated in their reports by introducing a placebo group that prepopulated the reports with random skeletal ages. However, such a blinded trial design would obscure the effect of automation bias and would raise ethical concerns about increasing overall diagnostic error. Because the trial was unblinded, it is possible the performance of radiologists in the control group was artificially accurate because of pressure to outperform the with-AI group, which has been the same criticism of other unblinded studies (33).

In this prospective multicenter randomized controlled trial comparing use of an artificial intelligence (AI) algorithm as a diagnostic aid with the current standard of care, overall diagnostic error was significantly decreased when the AI algorithm was used compared with when it was not. Diagnostic error was decreased with use of the AI algorithm at some but not all centers, including an outlier center where diagnostic error was increased. Taken together, these findings support careful consideration of AI for use as a diagnostic aid for radiologists and reinforce the importance of interactive effects between human radiologists and AI algorithms in determining potential benefits and harms of assistive technologies in clinical medicine.

Disclosures of Conflicts of Interest: D.K.E. disclosed no relevant relationships. N.B.K. is the cofounder of Bunkerhill Health. J.L. disclosed no relevant relationships. N.R.F. disclosed no relevant relationships. S.V.L. disclosed no relevant relationships. N.A.S. disclosed no relevant relationships. S.S.M. disclosed no relevant relationships. R.W.F. is on the Bunkerhill Health advisory board in exchange for a 0.5% stake in the company, received an honorarium from the Korean Society of Radiology for speaking at the Korean Congress of Radiology in 2019. S.E.S. disclosed no relevant relationships. A.J.T. is a consultant for Applied Radiology; institution received grants from Guerbet and the Cystic Fibrosis Foundation; receives royalties from Elsevier. M.L.F. disclosed no relevant relationships. S.L.K. disclosed no relevant relationships. K.E. disclosed no relevant relationships. S.P.P. disclosed no relevant relationships. B.J.D. disclosed no relevant relationships. B.M.E. disclosed no relevant relationships. C.G.A. receives book royalties from Amirsys. M.E.B. disclosed no relevant relationships. R.D. disclosed no relevant relationships. D.B.L. institution receives research support from Siemens Healthineers, holds stock in Bunkerhill Health. J.M.S. disclosed no relevant relationships. C.T.S. disclosed no relevant relationships. A.R.Z. disclosed no relevant relationships. C.P.L. received stock in Bunkerhill Health in return for board membership; received stock options in whiterabbit.ai, Nines.com, GalileoCDS, and Sirona Medical for service on their advisory boards; institution received grants from Bunkerhill Health, Carestream, GE Healthcare, Google Cloud, IBM, IDEXX, Lambda, Lunit, Nines, Subtle Medical, whiterabbit.ai, Bayer, Fuji, and Kheiron; received honoraria for virtual presentations at radiology meetings from Canon and auntminnie.com; was reimbursed for travel by Siemens. M.P.L. is on the board of Carestream and Nines Radiology; is a consultant for Bayer and Microsoft; holds stock in Bunkerhill Health. S.S.H. disclosed no relevant relationships.

Author Contributions

Author contributions: Guarantors of integrity of entire study, D.K.E., N.B.K., J.L., S.S.M., S.S.H.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, D.K.E., N.B.K., J.L., A.J.T., J.M.S., C.P.L., S.S.H.; clinical studies, D.K.E., N.B.K., J.L., N.R.F., R.W.F., S.E.S., A.J.T., M.L.F., S.L.K., K.E., S.P.P., C.G.A., M.E.B., R.D., A.R.Z., S.S.H.; statistical analysis, D.K.E., N.B.K., J.L., M.P.L.; and manuscript editing, D.K.E., N.B.K., J.L., S.S.M., R.W.F., S.E.S., A.J.T., S.L.K., K.E., S.P.P., B.M.E., C.G.A., R.D., D.B.L., J.M.S., C.T.S., A.R.Z., C.P.L., M.P.L., S.S.H.

References

  • 1. De Sanctis V, Di Maio S, Soliman AT, Raiola G, Elalaily R, Millimaggi G. Hand x-ray in pediatric endocrinology: skeletal age assessment and beyond. Indian J Endocrinol Metab 2014;18(7 Suppl 1):S63–S71. MedlineGoogle Scholar
  • 2. Dimeglio A, Canavese F. Progression or not progression? How to deal with adolescent idiopathic scoliosis during puberty. J Child Orthop 2013;7(1):43–49. Crossref, MedlineGoogle Scholar
  • 3. Tanner JM, Whitehouse RH. Assessment of skeletal maturity and prediction of adult height (TW2 method). London, England:Academic Press;1975. Google Scholar
  • 4. Gilsanz V, Ratib O. Hand bone age: a digital atlas of skeletal maturity. Berlin, Germany:Springer-Verlag;2005. Google Scholar
  • 5. Gaskin CM, Kahn SL, Bertozzi JC, Bunch PM. Skeletal development of the hand and wrist. Oxford, England:Oxford University Press;2011. CrossrefGoogle Scholar
  • 6. Greulich WW, Pyle SI. Radiographic atlas of skeletal development of the hand and wrist. Stanford, California:Stanford University Press;1959. CrossrefGoogle Scholar
  • 7. Yuan R, Vos PM, Cooperberg PL. Computer-aided detection in screening CT for pulmonary nodules. AJR Am J Roentgenol 2006;186(5):1280–1287. Crossref, MedlineGoogle Scholar
  • 8. Das M, Mühlenbruch G, Mahnken AH, et al. Small pulmonary nodules: effect of two computer-aided detection systems on radiologist performance. Radiology 2006;241(2):564–571. LinkGoogle Scholar
  • 9. Lee IJ, Gamsu G, Czum J, Wu N, Johnson R, Chakrapani S. Lung nodule detection on chest CT: evaluation of a computer-aided detection (CAD) system. Korean J Radiol 2005;6(2):89–93. Crossref, MedlineGoogle Scholar
  • 10. Fenton JJ, Taplin SH, Carney PA, et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med 2007;356(14):1399–1409. Crossref, MedlineGoogle Scholar
  • 11. Fenton JJ, Abraham L, Taplin SH, et al. Effectiveness of computer-aided detection in community mammography practice. J Natl Cancer Inst 2011;103(15):1152–1161. Crossref, MedlineGoogle Scholar
  • 12. Lehman CD, Wellman RD, Buist DSM, et al. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern Med 2015;175(11):1828–1837. Crossref, MedlineGoogle Scholar
  • 13. Gilbert FJ, Astley SM, Gillan MG, et al. Single reading with computer-aided detection for screening mammography. N Engl J Med 2008;359(16):1675–1684. Crossref, MedlineGoogle Scholar
  • 14. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577(7788):89–94. Crossref, MedlineGoogle Scholar
  • 15. Yala A, Lehman C, Schuster T, Portnoi T, Barzilay R. A deep learning mammography-based model for improved breast cancer risk prediction. Radiology 2019;292(1):60–66. LinkGoogle Scholar
  • 16. Sim Y, Chung MJ, Kotter E, et al. Deep convolutional neural network–based software improves radiologist detection of malignant lung nodules on chest radiographs. Radiology 2020;294(1):199–209. LinkGoogle Scholar
  • 17. Park A, Chute C, Rajpurkar P, et al. Deep learning-assisted diagnosis of cerebral aneurysms using the HeadXNet model. JAMA Netw Open 2019;2(6):e195600. Crossref, MedlineGoogle Scholar
  • 18. Ardila D, Kiraly AP, Bharadwaj S, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 2019;25(6):954–96.[Published correction appears in Nat Med 2019;25(8):1319.]. Crossref, MedlineGoogle Scholar
  • 19. Varma M, Lu M, Gardner R, et al. Automated abnormality detection in lower extremity radiographs using deep learning. Nat Mach Intell 2019;1(12):578–583. CrossrefGoogle Scholar
  • 20. Kim DW, Jang HY, Kim KW, Shin Y, Park SH. Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: Results from recently published papers. Korean J Radiol 2019;20(3):405–410. Crossref, MedlineGoogle Scholar
  • 21. Lee H, Tajmir S, Lee J, et al. Fully automated deep learning system for bone age assessment. J Digit Imaging 2017;30(4):427–441. Crossref, MedlineGoogle Scholar
  • 22. Larson DB, Chen MCC, Lungren MP, Halabi SS, Stence NV, Langlotz CP. Performance of a deep-learning neural network model in assessing skeletal maturity on pediatric hand radiographs. Radiology 2018;287(1):313–322. LinkGoogle Scholar
  • 23. Tajmir SH, Lee H, Shailam R, et al. Artificial intelligence-assisted interpretation of bone age radiographs improves accuracy and decreases variability. Skeletal Radiol 2019;48(2):275–283. Crossref, MedlineGoogle Scholar
  • 24. Halabi SS, Prevedello LM, Kalpathy-Cramer J, et al. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology 2019;290(2):498–503. LinkGoogle Scholar
  • 25. Cicero M, Bibiliy A. Machine learning and the future of radiology: how we won the 2017 RSNA ML Challenge. 16 Bit. https://www.16bit.ai/blog/ml-and-future-of-radiology. Published November 23, 2017. Accessed May 2018. Google Scholar
  • 26. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Cornell University.https://arxiv.org/abs/1512.00567. Posted December 2, 2015. Accessed May 2018. Google Scholar
  • 27. Ting DSW, Cheung CY, Lim G, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 2017;318(22):2211–2223. Crossref, MedlineGoogle Scholar
  • 28. Li X, Zhang S, Zhang Q, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol 2019;20(2):193–201. Crossref, MedlineGoogle Scholar
  • 29. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med 2018;15(11):e1002683. Crossref, MedlineGoogle Scholar
  • 30. Rothwell PM. External validity of randomised controlled trials: “to whom do the results of this trial apply?” Lancet 2005;365(9453):82–93. Crossref, MedlineGoogle Scholar
  • 31. Hernandez-Boussard T, Monda KL, Crespo BC, Riskin D. Real world evidence in cardiovascular medicine: ensuring data validity in electronic health record-based studies. J Am Med Inform Assoc 2019;26(11):1189–1194. Crossref, MedlineGoogle Scholar
  • 32. Gur D, Bandos AI, Cohen CS, et al. The “laboratory” effect: comparing radiologists’ performance and variability during prospective clinical and laboratory mammography interpretations. Radiology 2008;249(1):47–53. LinkGoogle Scholar
  • 33. Zhou Q, Cao YH, Chen ZH. Optimizing the study design of clinical trials to identify the efficacy of artificial intelligence tools in clinical practices. EClinicalMedicine 2019;16(10):11. Google Scholar

Article History

Received: Jan 14 2021
Revision requested: Mar 16 2021
Revision received: June 24 2021
Accepted: July 22 2021
Published online: Sept 28 2021
Published in print: Dec 2021