Development and Validation of a Convolutional Neural Network for Automated Detection of Scaphoid Fractures on Conventional Radiographs
Abstract
Purpose
To compare the performance of a convolutional neural network (CNN) to that of 11 radiologists in detecting scaphoid bone fractures on conventional radiographs of the hand, wrist, and scaphoid.
Materials and Methods
At two hospitals (hospitals A and B), three datasets consisting of conventional hand, wrist, and scaphoid radiographs were retrospectively retrieved: a dataset of 1039 radiographs (775 patients [mean age, 48 years ± 23 {standard deviation}; 505 female patients], period: 2017–2019, hospitals A and B) for developing a scaphoid segmentation CNN, a dataset of 3000 radiographs (1846 patients [mean age, 42 years ± 22; 937 female patients], period: 2003–2019, hospital B) for developing a scaphoid fracture detection CNN, and a dataset of 190 radiographs (190 patients [mean age, 43 years ± 20; 77 female patients], period: 2011–2020, hospital A) for testing the complete fracture detection system. Both CNNs were applied consecutively: The segmentation CNN localized the scaphoid and then passed the relevant region to the detection CNN for fracture detection. In an observer study, the performance of the system was compared with that of 11 radiologists. Evaluation metrics included the Dice similarity coefficient (DSC), Hausdorff distance (HD), sensitivity, specificity, positive predictive value (PPV), and area under the receiver operating characteristic curve (AUC).
Results
The segmentation CNN achieved a DSC of 97.4% ± 1.4 with an HD of 1.31 mm ± 1.03. The detection CNN had sensitivity of 78% (95% CI: 70, 86), specificity of 84% (95% CI: 77, 92), PPV of 83% (95% CI: 77, 90), and AUC of 0.87 (95% CI: 0.81, 0.91). There was no difference between the AUC of the CNN and that of the radiologists (0.87 [95% CI: 0.81, 0.91] vs 0.83 [radiologist range: 0.79–0.85]; P = .09).
Conclusion
The developed CNN achieved radiologist-level performance in detecting scaphoid bone fractures on conventional radiographs of the hand, wrist, and scaphoid.
Keywords: Convolutional Neural Network (CNN), Deep Learning Algorithms, Machine Learning Algorithms, Feature Detection-Vision-Application Domain, Computer-Aided Diagnosis
See also the commentary by Li and Torriani in this issue.
Supplemental material is available for this article.
©RSNA, 2021
Summary
A convolutional neural network achieved radiologist-level performance in detecting scaphoid fractures on conventional radiographs of the hand, wrist, and scaphoid.
Key Points
■ A deep learning system based on convolutional neural networks was developed to detect scaphoid bone fractures on conventional radiographs of the hand, wrist, and scaphoid and had sensitivity of 78%, specificity of 84%, positive predictive value of 83%, and an area under the receiver operating characteristic curve of 0.87.
■ The deep learning system was able to detect scaphoid fractures just as well as 11 radiologists did, achieving a comparable area under the receiver operating characteristic curve (0.87 vs 0.83 [average of all radiologists]; P = .09).
■ Class activation maps were found to overlap with fracture lines in the scaphoid bone and hence could be used for localizing potential fractures.
Introduction
Scaphoid bone fractures are the most common fractures of the carpal bones (82%–89%) and comprise 2%–7% of all skeletal fractures (1). They may be difficult to detect in the acute phase, because the likelihood of radiologically occult scaphoid fractures has been estimated at 7%–21% (2,3) and may be as high as 50%, as reported in a recent prospective study (4). It is important to diagnose scaphoid fractures at an early stage because nonunion may occur in up to 12% of patients if an occult fracture remains untreated (5). Nonunion can be prevented when the fracture is diagnosed within the 1st week after fracture and can then be followed by plaster immobilization (6). Nonunion fractures may lead to complications such as avascular necrosis, carpal instability, osteoarthritis, the need for surgical intervention for bone repositioning and fixation or resection of the proximal carpal row, and they may ultimately result in functional loss (6–8). As a result of the risks involved, more than half of the patients clinically suspected of having a scaphoid fracture receive unnecessary wrist immobilization as a precaution, which increases health expenditure and decreases patients’ productivity (9,10). Conventional radiography is the imaging technique of choice for diagnosing scaphoid fractures because it is readily available and cost-efficient but has low sensitivity (66%–81%) (11,12).
Recently, Langerhuizen et al (13) demonstrated the feasibility of applying convolutional neural networks (CNNs) for the detection of scaphoid fractures on conventional radiographs. Their experimental CNN achieved accuracy and sensitivity similar to those of five orthopedic surgeons in scaphoid fracture detection but tended to miss obvious fractures. The authors therefore concluded that the CNN was still inferior to human observers at identifying scaphoid fractures on radiographs, and they recommended follow-up research with larger datasets and further algorithm refinements.
Our hypothesis was that CNNs could achieve expert-level performance in detecting scaphoid fractures by addressing these recommendations, which may reduce the risk of missing a fracture, reduce the costs of additional imaging studies and unnecessary therapy, speed up diagnosis, and allow earlier treatment. We investigated this hypothesis by developing a fully automated CNN-based system for scaphoid fracture detection. The purpose of this study was (a) to develop a segmentation and detection model and (b) to validate the performance of our system by comparing to radiologists in detecting scaphoid fractures on conventional radiographs of the hand, wrist, and scaphoid.
Materials and Methods
Datasets
This retrospective study was approved by the medical ethical review boards of the hospitals Jeroen Bosch Ziekenhuis (JBZ) and Radboud University Medical Center (Radboudumc) in the Netherlands. Informed written consent was waived, and data collection and storage were performed in accordance with local guidelines. Three different datasets consisting of hand, wrist, and scaphoid radiographs were prepared for training and testing a scaphoid segmentation CNN (dataset 1: 1039 radiographs [from 775 patients] acquired during 2017–2019 at JBZ and Radboudumc), for training a fracture detection CNN (dataset 2: 3000 radiographs [in 1846 patients] acquired during 2003–2019 at Radboudumc), and for testing the entire fracture detection system (dataset 3: 190 radiographs [in 190 patients] acquired during 2011–2020 at JBZ). Datasets 2 and 3 were gathered at different hospitals to assess the generalization performance of the fracture detection CNN. All radiographs were extracted from the picture archiving and communication system of JBZ and Radboudumc and were de-identified by removing metadata. Details of the datasets are provided in Table 1.
The datasets were manually selected and annotated by N.H. in consultation with a musculoskeletal radiologist with 30 years of experience (M.R.). For datasets 1 and 2, the software Visual Geometry Group (VGG; https://www.robots.ox.ac.uk/~vgg/software/via/) Image Annotator (14) was used to create scaphoid segmentation masks and bounding boxes. For datasets 2 and 3, binary labels for scaphoid fracture detection were derived from the original radiology reports. Dubious scaphoid fractures were reevaluated by radiologist M.R. (only for dataset 2). The diagnoses of all fractures in dataset 3 were confirmed with a follow-up CT scan within 4 weeks (results of which were considered the ground truth).
Only anteroposterior and posteroanterior radiographs were selected for the datasets, because these showed the least amount of overlap between the scaphoid and other bones. Radiographs were excluded when accurate annotation or fracture diagnosis was impossible as a result of screws or other implants, resection, excessive damage or malformation, or a cast. Moreover, radiographs depicting old scaphoid fractures were excluded from dataset 3, because the focus of the current study was on early fracture diagnosis. Additional patient information and an overview of all criteria are provided in Appendixes E1 and E2 (supplement).
Model Pipeline
Figure 1 shows an overview of the model pipeline, which consisted of two main components: a scaphoid segmentation and fracture detection CNN (from now on referred to as segmentation and detection CNN). A radiograph was first preprocessed by normalizing the scaphoid size and fixing the image size by padding or cropping. Then, the segmentation CNN localized the scaphoid on the radiograph to remove irrelevant regions by cropping. A lightweight custom architecture was designed to preserve the original image resolution as much as possible, because the segmentation task is complicated by the overlap of the scaphoid with surrounding carpal bones. After the cropping operation, the image was rescaled to a fixed size, and its contrast was normalized. A detailed description of the architecture, preprocessing steps, and training procedure is provided in Appendixes E3–E5 (supplement).
Next, the detection CNN processed the cropped image and returned a probability of whether the scaphoid contained a fracture. This CNN was based on a DenseNet-121 architecture (15). A class activation map was then calculated using the Smooth Grad-CAM++ method (16), which showed regions in the image that were most relevant for predicting whether a fracture was present (see Appendix E6 [supplement] for implementation details). These regions were expected to correlate with any existing fracture line. The class activation map was projected on the input image as a heatmap for easy localization of potential fractures (Fig 1).
Implementation Details
The CNNs were implemented on a graphics processing unit (Nvidia RTX Titan) using the PyTorch machine learning framework (17). Through supervised learning, the segmentation CNN was trained on a random subset of dataset 1 (80%), and the detection CNN was trained on all samples from dataset 2. The Adam optimizer (18) was used to update the weights by a linearly decaying learning rate schedule. A detailed account of the training procedure is provided in Appendix E5 (supplement).
Observer Study
For comparing the performance of the fracture detection system with that of radiologists, an observer study was conducted among 11 radiologists (three of whom were radiology residents; B.V., E.S., M.d.J., B.M., S.D., S. Bollen, A.S., S. Bruijnen, S.S., M.R., M.d.R. [random order]). The radiologists independently assessed 190 radiographs from dataset 3 for the presence of a scaphoid fracture and indicated their confidence for each radiograph on a continuous scale from 0 to 1.0, where 1.0 indicated absolute certainty of a fracture. They used 0.5 as the cutoff point for the decision whether a fracture was present. We defined an average radiologist by averaging the scores of the 11 readers.
Statistical Analysis
The segmentation CNN was evaluated on an unseen subset of dataset 1 (20%, no patient overlap) by calculating the Dice similarity coefficient (DSC) and symmetric Hausdorff distance (HD). The detection CNN was internally evaluated in dataset 2 by means of fivefold cross-validation (manual scaphoid crops, no patient overlap) by calculating the sensitivity, specificity, positive predictive value (PPV), and area under the receiver operating characteristic curve (AUC). The whole fracture detection system was evaluated on hold-out set dataset 3 by calculating the sensitivity, specificity, PPV, and AUC (automated scaphoid crops, no patient overlap with datasets 1 and 2). The classification threshold was set at 0.5 for both CNNs.
The evaluation metrics were calculated using the scikit-learn machine learning library (version 0.22.1, 2020) (19) and the MedPy medical image processing library (version 0.4.0, 2019) (20) for Python. Stratified bootstrapping with 5000 iterations was applied for estimating 95% CIs. To compare both the AUC of the CNN with that of the radiologists and the AUC of the attending radiologists with that of the radiology residents, multiple-reader, multiple-case (MRMC) receiver operating characteristic (ROC) analyses were conducted using the iMRMC software (version 1.2.0, 2020) (21). These analyses were based on a t test, where the degrees of freedom were estimated as proposed by Obuchowski et al (22). A difference with a P value smaller than .05 was considered significant.
Model Availability
This system is freely available at https://grand-challenge.org/algorithms/scaphoid-fracture-detection/, where it can be run in a web browser in real time.
Results
Segmentation CNN Results
The segmentation CNN achieved an average DSC of 97.4% ± 1.4 (standard deviation) with an HD of 1.31 mm ± 1.03 on the test set (subset of dataset 1). One test sample led to a segmentation failure in which no pixels were segmented, and for this sample no HD could be calculated. The average height and width of the ground truth scaphoid segmentation masks were 23.12 mm ± 3.84 and 14.30 mm ± 2.03, respectively, which were estimated by calculating the major and minor axis length of ellipses fitted to the masks. The total training time of this network was approximately 15 hours.
Detection CNN Results
On dataset 2, the detection CNN achieved a mean sensitivity of 66% ± 3, specificity of 90% ± 1, PPV of 81% ± 2, and AUC of 0.86 ± 0.01 over the validation folds. The mean ROC curve on this set from the fivefold cross-validation procedure is provided in Appendix E7 (supplement). The ROC curve indicates that the CNN had a sensitivity of 73% at the same specificity as measured on the test set (dataset 3), 84%.
The training time of the detection CNN was approximately 2 hours. The average processing time per test sample (dataset 3) was 1.7 seconds (including all preprocessing steps, as listed in Appendix E4 [supplement]). The average reading time per test sample of the radiologists was 13.9 seconds.
Comparison of the CNN to Radiologists
Table 2 presents the sensitivity, specificity, PPV, and AUC with their 95% CIs for the CNN and the radiologists on the test set (dataset 3). The ROC curves for the CNN (with the 95% CI band), the average of radiologists, and the radiologists with the highest and lowest AUCs are shown in Figure 2A. The same ROC curve for the CNN with individual operating points of the CNN and radiologists are plotted in Figure 2B. These operating points indicate the sensitivity and false-positive rate at a threshold of 50%. The CNN and radiologists achieved similar performance for fracture detection (AUC, 0.87 [95% CI: 0.81, 0.91] vs 0.83 [range, 0.79–0.85]; P = .09 with MRMC ROC analysis). At a fixed false-positive rate of 5.0%, the CNN achieved a sensitivity of 65% (61 of 95 patients with a scaphoid fracture would be recommended for follow-up procedures). In comparison, at the same false-positive rate, the average of radiologists (n = 11) achieved a sensitivity of 55% (equating to 51 of 95 patients). Among the radiologists, the attending radiologists (n = 8) and radiology residents (n = 3) achieved similar performance (AUC, 0.83 [range, 0.79–0.85] vs 0.82 [range, 0.81–0.84], P = .86 with MRMC analysis).
To inspect test cases that were misdiagnosed by either radiologists or the CNN, the fracture confidence score of the CNN versus the average fracture confidence score of the radiologists per case are plotted in Figure 3. The CNN produced four false-negative results and 14 false-positive results not made by the average radiologist. Among these cases of misdiagnosis, with high certainty, the CNN missed one fracture (confidence score = 0.01) and misclassified six nonfractures as fractures (confidence score, ≥ 0.8) that were all easily identified by the radiologists (confidence scores ≥ 0.8 and ≤ 0.2; respectively). The average radiologist made 18 false-negatives and one false-positive not made by the CNN. Among the false-negative cases, with high certainty the average radiologist missed five fractures (confidence score ≤ 0.2) that were confidently identified by the CNN (confidence score ≥ 0.8). Additional examples of scaphoid fractures that were occult to the average radiologist but not for the CNN are provided in Appendix E8 (supplement).
Fracture Localization with Class Activation Maps
Examples of class activation maps that were generated for localizing (potential) fractures in the radiographs from dataset 3 are shown in Figure 4. The image crops that were fed into the detection CNN and the corresponding class activation maps are presented in pairs. On manual inspection, highlighted regions in the class activation maps were found to overlap with the fracture lines in the scaphoid, as illustrated in Figure 4.
Discussion
In this study, we assessed whether a CNN could achieve human-level performance in detecting scaphoid fractures on conventional radiographs of the hand, wrist, and scaphoid. To this end, an experimental scaphoid fracture detection system was developed, which consisted of two networks: a segmentation CNN for scaphoid localization and a detection CNN for fracture detection. The CNN-based system segmented the scaphoid with high accuracy (DSC of 97.4% with a HD of 1.31 mm) and had an AUC (0.87) comparable with that of 11 radiologists (mean AUC, 0.83) in detecting scaphoid fractures. In the only segmentation failure case, the radiograph was tightly cropped around the scaphoid, possibly causing the system to not recognize the scaphoid due to a lack of contextual information. Moreover, an inspection of the cases of fracture detection failure revealed that the system displayed a greater sensitivity but lower specificity to detect fractures compared with the average radiologist at a classification threshold of 0.5 (Fig 3). The CNN output high confidence scores for five fractures that were largely overlooked by the radiologists, albeit at the cost of making six high-confidence false-positive predictions for cases that were evident for the average radiologist. This observation is in line with the finding of Langerhuizen et al (13) that their CNN-based system was less specific than the human observers, although our system displayed no tendency to miss obvious fractures. However, at a fixed specificity of 95%, the system achieved higher sensitivity than the average radiologist (65% vs 55%), indicating that the system could help to reduce the number of overlooked fractures in clinical practice.
A qualitative analysis of the generated class activation maps showed that regions that were important to the decision of the system were correlated with the scaphoid fractures. These class activation maps could be provided to the radiologist to explain which regions the CNN has identified as possible fracture lines. Explainability is often mentioned as a prerequisite before artificial intelligence solutions can be deployed in the clinic (23,24). The maps could provide a level of explainability and may help the radiologist to discard false-positive results. Furthermore, they might form a good alternative to segmentation or bounding box detection of scaphoid fractures. Although the most accurate localization of scaphoid fractures can be theoretically achieved by training a system with pixel-level or object-level annotations, generating such annotations is highly challenging and time-intensive, which prevents the curation of larger datasets. It can be expected that the quality of the class activation maps will improve as the diagnostic performance of the deep learning model gradually increases.
The strengths of our current study were the use of clinical data from two hospitals, the participation of 11 radiologists in the observer study, the automatic detection of the scaphoid, and the added transparency to the fracture detection system. However, this study also had three main limitations. First, the use of a single radiograph view for diagnosis of scaphoid fractures may have limited the performance of the radiologists and the system. In daily practice, radiologists use multiple radiographic views, because scaphoid fractures are frequently not visible in all directions. Follow-up studies may involve the use of multiple radiographic views to investigate the extent to which these additional views benefit the task at hand.
Second, the test set could contain a selection bias because all patients with radiographs also underwent a follow-up CT scan. CT scans are often performed when a clinical or radiologic suspicion for a scaphoid fracture cannot be confirmed on conventional radiographs. Therefore, it is plausible that the selected radiographs were more difficult to assess than most radiographs in clinical practice. These limitations may explain why the reported sensitivity values for diagnosing scaphoid fractures differ from the average sensitivity reported in the literature (11,12).
Finally, not all CT scans that are negative for fracture and were used as the reference standard in the test set were followed by MRI. Gibney et al (4) demonstrated that clinically relevant scaphoid fractures can be missed even on CT scans, and they recommend complementing CT scans with MRI for the best possible sensitivity.
In conclusion, our findings supported the hypothesis that a CNN can achieve human-level performance in detecting scaphoid fractures on conventional radiographs of the hand, wrist, and scaphoid bone. A CNN may be able to assist residents, radiologists, or other physicians by acting as a first or second reader, or as a triage tool to prioritize worklists, potentially reducing the risk of missing a fracture. Future research should investigate to what extent CNNs could improve the diagnostic performance of radiologists.
Acknowledgments
We would like to acknowledge the resources provided by the hospitals Jeroen Bosch Ziekenhuis and Radboud University Medical Center for conducting this study. We also thank Chris Peters, MSc, PhD, and Peter Pijnenburg for their assistance in data acquisition, and Willem Huijbers for his feedback and guidance when setting up the research project.
Author Contributions
Author contributions: Guarantors of integrity of entire study, N.H., B.M.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, N.H., B.M., M.d.J., L.L.S.O., E.P., B.v.G., M.R.; clinical studies, E.S., S. Bruijnen, B.M., M.d.J., S.D., S.S., T.S., B.v.G., M.R.; experimental studies, N.H., B.V., B.M., M.d.J., S. Bollen, S.S., M.d.J., L.L.S.O., B.v.G., M.R.; statistical analysis, N.H., B.M., M.d.J., L.L.S.O., E.P., B.v.G., M.R.; and manuscript editing, N.H., E.S., B.V., S. Bruijnen, B.M., M.d.J., S.D., S.S., M.d.J., W.H., L.L.S.O., E.P., B.v.G., M.R.
Authors declared no funding for this work.
References
- 1. . Current methods of diagnosis and treatment of scaphoid fractures. Int J Emerg Med 2011;44.
- 2. . Comparison of MRI, CT and bone scintigraphy for suspected scaphoid fractures. Eur J Trauma Emerg Surg 2016;42(6):725–731.
- 3. . Radiography and scintigraphy of suspected scaphoid fracture. A long-term study in 160 patients. J Bone Joint Surg Br 1993;75(1):61–65.
- 4. . Incorporating cone-beam CT into the diagnostic algorithm for suspected radiocarpal fractures: A new standard of care?. AJR Am J Roentgenol 2019;213(5):1117–1123.
- 5. . Scaphoid fractures and nonunions: diagnosis and treatment. J Orthop Sci 2006;11(4):424–431.
- 6. . Diagnosis and treatment of scaphoid fractures, can non-union be prevented?. Arch Orthop Trauma Surg 1999;119(7-8):428–431.
- 7. . The presentation of scaphoid non-union. Injury 2003;34(1):65–67.
- 8. . On resection of the proximal carpal row. Clin Orthop Relat Res 1986;(202):12–15.
- 9. . The suspected scaphoid injury: resource implications in the absence of magnetic resonance imaging. Scott Med J 2013;58(3):143–148.
- 10. . The diagnosis of recent scaphoid fractures: review of the literature [in French]. J Radiol 2007;88(5 Pt 2):741–759.
- 11. . Wrist fractures: sensitivity of radiography, prevalence, and patterns in MDCT. Emerg Radiol 2015;22(3):251–256.
- 12. . MDCT and radiography of wrist fractures: radiographic sensitivity and fracture patterns. AJR Am J Roentgenol 2008;190(1):10–16.
- 13. . Is Deep Learning On Par with Human Observers for Detection of Radiographically Visible and Occult Fractures of the Scaphoid?. Clin Orthop Relat Res 2020;478(11):2653–2659.
- 14. . The VIA Annotation Software for Images, Audio and Video. In:
MM ‘19: Proceedings of the 27th ACM International Conference on Multimedia .New York, NY :ACM 2019 ;2276–2279. - 15. . Densely connected convolutional networks.
In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,Honolulu, HI ,July 21–26, 2017 .Piscataway, NJ:IEEE,2017. - 16. . Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. ArXiv [preprint]. http://arxiv.org/abs/1908.01224. Posted August 3, 2019. Accessed October 27, 2020.
- 17. . Garnett R, eds.PyTorch: An Imperative Style, High-Performance Deep Learning Library. In:
Advances in Neural Information Processing Systems , Vol 32.Red Hook NY: Curran Associates,2019;8024–8035.http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. - 18. . Adam: A method for stochastic optimization. ArXiv [preprint]. https://arxiv.org/abs/1412.6980. Posted December 22, 2014. Updated January 30, 2017. Accessed October 27, 2020.
- 19. . Scikit-learn: Machine Learning in Python. J Mach Learn Res 2011;12(2825):2830.http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.
- 20. . MedPy-Medical image processing in Python. 2016.
- 21. . Generalized Roe and Metz receiver operating characteristic model: analytic link between simulated decision scores and empirical AUC variances and covariances. J Med Imaging (Bellingham) 2014;1(3):031006.
- 22. . Multi-reader ROC studies with split-plot designs: a comparison of statistical methods. Acad Radiol 2012;19(12):1508–1517.
- 23. . What do we need to build explainable AI systems for the medical domain?. ArXiv [preprint]. http://arxiv.org/abs/1712.09923. Posted December 28, 2017. Accessed October 27, 2020.
- 24. . The Algorithmic Audit: Working with Vendors to Validate Radiology-AI Algorithms-How We Do It. Acad Radiol 2020;27(1):132–135.
Article History
Received: Oct 27 2020Revision requested: Dec 3 2020
Revision received: Mar 19 2021
Accepted: Mar 30 2021
Published online: Apr 28 2021