Artificial Intelligence for Classification of Soft-Tissue Masses at US

Published Online:https://doi.org/10.1148/ryai.2020200125

Abstract

Purpose

To train convolutional neural network (CNN) models to classify benign and malignant soft-tissue masses at US and to differentiate three commonly observed benign masses.

Materials and Methods

In this retrospective study, US images obtained between May 2010 and June 2019 from 419 patients (mean age, 52 years ± 18 [standard deviation]; 250 women) with histologic diagnosis confirmed at biopsy or surgical excision (n = 227) or masses that demonstrated imaging characteristics of lipoma, benign peripheral nerve sheath tumor, and vascular malformation (n = 192) were included. Images in patients with a histologic diagnosis (n = 227) were used to train and evaluate a CNN model to distinguish malignant and benign lesions. Twenty percent of cases were withheld as a test dataset, and the remaining cases were used to train the model with a 75%-25% training-validation split and fourfold cross-validation. Performance of the model was compared with retrospective interpretation of the same dataset by two experienced musculoskeletal radiologists, blinded to clinical history. A second group of US images from 275 of the 419 patients containing the three common benign masses was used to train and evaluate a separate model to differentiate between the masses. The models were trained on the Keras machine learning platform (version 2.3.1), with a modified pretrained VGG16 network. Performance metrics of the model and of the radiologists were compared by using the McNemar test, and 95% CIs for performance metrics were estimated by using the Clopper-Pearson method (accuracy, recall, specificity, and precision) and the DeLong method (area under the receiver operating characteristic curve).

Results

The model trained to classify malignant and benign masses demonstrated an accuracy of 79% (95% CI: 68, 88) on the test data, with an area under the receiver operating characteristic curve of 0.91 (95% CI: 0.84, 0.98), matching the performance of two expert readers. Performance of the model distinguishing three benign masses was lower, with an accuracy of 71% (95% CI: 61, 80) on the test data.

Conclusion

The trained CNN was capable of differentiating between benign and malignant soft-tissue masses depicted on US images, with performance matching that of two experienced musculoskeletal radiologists.

Keywords: Convolutional Neural Network (CNN), Diagnosis, Neoplasms-Primary, Soft Tissues/Skin, Ultrasound

© RSNA, 2020

Summary

A convolutional neural network trained by using a limited dataset can distinguish between benign and malignant superficial soft-tissue masses depicted on US images, with performance matching that of two experienced radiologists.

Key Points

  • ■ Despite a limited training dataset, the convolutional neural network model demonstrated high recall (90%) for classifying malignant lesions and high precision (95%) for classifying benign lesions, as well as an area under the receiver operating characteristic curve of 0.91.

  • ■ There was no difference in accuracy, sensitivity, and specificity of the model classifying malignant versus benign masses compared with those of two experienced musculoskeletal radiologists interpreting the same test dataset.

  • ■ A separate model trained to classify three common benign masses (lipoma, benign nerve sheath tumor, and vascular malformation) demonstrated high recall and precision for classifying lipomas (93% and 78%, respectively) but lower recall and precision for classifying benign nerve sheath tumor (42% and 71%, respectively) and vascular malformation (64% and 60%, respectively).

Introduction

Localized swelling and palpable soft-tissue masses are common clinical presentations that prompt patients to seek care. The use of focused US as a supplement to clinical examination is increasingly common as the first-line diagnostic tool (14). While the majority of masses are found to be benign, many patients are referred for further imaging and work-up to exclude malignancy. With the social and economic burden of rising medical costs, there is increasing demand for a cost-effective method for triage of these patients (57). The reliability of US for soft-tissue mass characterization within the musculoskeletal system is dependent on the experience of the sonographer and the interpreting radiologist, which can be variable across institutions and practices. An automated tool that can perform basic classification tasks, such as distinguishing malignant and benign lesions, could be of value to sonographers and radiologists by providing clinical decision support and increasing diagnostic confidence.

Deep convolutional neural networks (CNNs) have demonstrated high levels of accuracy at visual recognitions tasks (8). Successful implementation of CNNs has been demonstrated in a variety of clinical imaging tasks, including but not limited to skin cancer detection (9), lung nodule characterization at screening CT (10), tuberculosis assessment at chest radiography (11), and pneumonia detection at chest radiography (12). There is also an evolving body of literature on the potential use of machine learning in soft-tissue lesion evaluation at diagnostic US. The application of artificial intelligence and neural networks has been described for thyroid nodule recognition and diagnosis (1315), axillary node and lesion assessment in breast cancer (16,17), endometrial thickness measurement (18), contrast-enhanced hepatocellular carcinoma evaluation (19), and prostate segmentation (20). There have also been several studies using artificial intelligence and CNNs in the musculoskeletal system, with respect to assessment of human skeletal muscle (21) and segmentation of the myotendinous junction (22), as well as for the assessment of bicipital peritendinous effusions to grade inflammation severity (23).

The development of a CNN model for characterizing US images of soft-tissue masses in the musculoskeletal system is an area of research to be explored. In this study, we present two CNN models that automatically classify soft-tissue masses depicted on US images. One model was trained to distinguish between benign and malignant masses, and the second model was trained to classify the three most common benign soft-tissue masses found at diagnostic US at our institution—lipoma, benign nerve sheath tumor, and vascular malformation. We hypothesize that with a limited dataset, CNNs can be trained to perform these classification tasks.

Materials and Methods

Patient Population

This retrospective study was approved by the institutional review board, and a waiver of consent was obtained. A systematic search of the radiology report database was performed by a single musculoskeletal radiologist (L.P. [1 year of experience]) to identify musculoskeletal US examinations performed for evaluation of a soft-tissue mass at the institution between May 2010 and June 2019. Included examinations comprised patients with masses with a histologic diagnosis confirmed by the results of a biopsy or surgical excision (n = 227) or masses that demonstrated pathognomonic imaging characteristics of lipoma, benign peripheral nerve sheath tumor, or vascular malformation (n = 192), as determined by a consensus read of three experienced musculoskeletal radiologists (L.P., C.B. [7 years of experience], R.S.A. [30 years of experience]). Examinations of benign masses without histologic diagnosis were included, as a large proportion of masses demonstrating pathognomonic features of a benign lesion at the time of imaging do not undergo subsequent biopsy or surgical excision. A total of 419 patients who underwent US were identified (mean age, 52 years ± 18 [standard deviation]; 250 women). The anatomic distribution of the masses included 163 in the upper extremities (38.9%), 165 in the lower extremities (39.4%), 78 in the trunk (18.6%), and 13 in the head and neck region (3.1%). Each patient included had one soft-tissue mass. The distribution of the pathologic findings is listed in Table 1.

Table 1: Distribution of Pathologic Findings of Soft-Tissue Masses from 419 Patients

Table 1:

US Imaging

For each patient, up to two grayscale and two Doppler images of the same mass at differing locations were saved from the picture archiving and communication system for training and evaluating the CNN models. The large majority of examinations were acquired by using Acuson S2000 or Acuson S3000 US systems (Siemens Healthineers) with a 14-MHz or 9-MHz linear transducer. As the images were extracted from examinations retrospectively, some patients had fewer than two grayscale or Doppler images available, either because a second image was not acquired at the time of the examination or because the additional images were of poor quality. Additionally, grayscale and Doppler images were not obtained in a coregistered manner, and variations in scanning parameters such as scanning depth, gain, and Doppler flow sensitivity could not be controlled for.

Data Preparation

Images were de-identified and stored in the tagged image file format, or TIFF. Individual images were cropped manually by a single musculoskeletal radiologist (L.P.) in Adobe Photoshop (Adobe Systems) into a rectangle that best fit the mass and excluded surrounding background soft tissue. Zero padding was achieved by overlaying the cropped image onto a 512 × 512-pixel black canvas in Photoshop.

To train the CNNs with both grayscale and Doppler information simultaneously, pairs of cropped grayscale and Doppler images were concatenated side by side to generate up to two 1024 × 512-pixel images per patient. For patients with fewer than two grayscale or Doppler images, only one image was generated.

Two separate groups were created for the purpose of training and evaluating the two different CNN models. The first CNN model was trained to classify between benign and malignant masses. Since histologic diagnosis is considered the reference standard for tumor characterization, only masses with a histologic diagnosis were selected for training and evaluation of this model, which accounted for 227 masses (344 images). For the purpose of categorization, a small subset of intermediate-grade lesions (n = 5) was considered as malignant for training and evaluation of the model. In total, there were 147 benign masses (245 images) and 80 malignant masses (99 images). Twenty percent of the dataset (46 masses [68 images]) was withheld as a holdout test dataset. The remaining 181 masses (276 images) were used for training the CNN model, with a 75%-25% training-validation split and fourfold cross-validation. Patient clinical characteristics and relative distribution of patients across the training, validation, and test datasets are summarized in Table 2.

Table 2: Clinical Characteristics Across Training, Validation, and Test Datasets

Table 2:

A second CNN model was trained to classify the three most common benign masses depicted on US images at our institution (lipoma, benign nerve sheath tumor, and vascular malformation). These masses were chosen as they are the only masses seen with sufficient frequency (>50 patients out of 419 patients) to reasonably train a CNN classifier. For this group, both histologically proven cases and imaging pathognomonic cases were included, for a total of 275 patients with soft-tissue masses (480 images in total consisting of 219 lipomas [from 119 patients], 102 benign nerve sheath tumors [from 65 patients], and 159 vascular malformations [from 91 patients], Table 1). Twenty percent of the dataset (55 masses [94 images]) was withheld as a holdout test dataset. The remaining 220 masses (386 images) were used to train the CNN model with a 75%-25% training-validation split and fourfold cross-validation. Patient clinical characteristics and relative distribution of benign masses across the training, validation, and test datasets are summarized in Table 2.

Model Architecture

We selected the VGG16 network architecture pretrained on the ImageNet database to build the two CNN models to take advantage of the benefits of transfer learning, as well as the performance of a deep neural network architecture (24). The VGG16 architecture was implemented on the Keras machine learning platform (version 2.3.1) (https://keras.io/), running the TensorFlow (version 1.15.0) (https://www.tensorflow.org/) backend (25,26).

The VGG16 network was loaded with pretrained weights from ImageNet. The fully connected layers of the original network were replaced with a customized classifier block, consisting of two fully connected layers with 64 nodes each, and the rectified linear unit activation function, followed by a two-class or three-class output layer by using the softmax activation function. As the size of our dataset was small and prone to overfitting on a large network such as VGG16, the VGG16 architecture was simplified by removing the final block of three convolution layers. To further reduce overfitting, L2 regularization was applied to the network, and dropout layers were applied between the fully connected layers of the classifier block (Fig 1). The first convolution layer of the model was frozen, and the remaining layers were set as trainable. Finally, the training examples were loaded through Keras with real-time data augmentation implemented. Fourfold cross-validation was used to select the optimal hyperparameters for training. The models were trained on a high-performance compute cluster running an NVIDIA Tesla V100-SXM2 graphics processing unit (GPU) (NVIDIA) with 16 GB random-access memory (RAM).

Summary of modified VGG16 architecture. The fifth block of convolution                         (Conv) layers was removed. The fully connected layers of the original model                         were replaced with a custom classifier block consisting of two fully                         connected layers with 64 nodes (rectified linear unit activation) and a two-                         or three-node output layer (softmax activation). To reduce overfitting,                         dropout was applied between the fully connected layers, and L2                         regularization was added to the network.

Figure 1: Summary of modified VGG16 architecture. The fifth block of convolution (Conv) layers was removed. The fully connected layers of the original model were replaced with a custom classifier block consisting of two fully connected layers with 64 nodes (rectified linear unit activation) and a two- or three-node output layer (softmax activation). To reduce overfitting, dropout was applied between the fully connected layers, and L2 regularization was added to the network.

Statistical Analysis

The performance metrics monitored during training of the CNN model for classifying benign and malignant masses included validation loss, accuracy, and area under the receiver operating characteristic curve (AUC). The model with the highest validation AUC across the four folds of cross-validation was selected for evaluation on the holdout test set. Overall performance of the model was measured by the accuracy, recall, precision, and AUC of the selected model on the holdout test set. As AUC is the primary performance metric being optimized for hyperparameter and model selection, the reported model accuracy, recall, and precision are based on the default threshold of 0.5.

The performance metrics monitored during training of the CNN model for classifying lipoma, benign nerve sheath tumor, and vascular malformation included validation loss, accuracy, and the weighted average of F1 scores. The model with the highest weighted-average F1 score across the four folds of cross-validation was selected for evaluation on the holdout test set. Overall performance of the model was measured by the accuracy, recall, precision, and F1 scores of the selected model on the holdout test set.

In addition, we compared the performance of the CNN model for classifying benign and malignant masses against two experienced musculoskeletal radiologists with 30 and 7 years of experience (R.S.A. and C.B., respectively). For this reader study, each radiologist, blinded to clinical history and patient demographics, evaluated the grayscale and Doppler image pairs in the test set and labeled the mass either benign or suspicious for malignancy on the basis of imaging appearance alone. Accuracy, sensitivity, and specificity for identifying malignant masses were calculated for each reader and compared with the CNN model on a receiver operating characteristic (ROC) curve. The performance of the model and the radiologists was compared by using the McNemar test (27). The 95% CIs for performance metrics were estimated by using the Clopper-Pearson method for accuracy, sensitivity (recall), specificity, and precision, and the DeLong method for AUC (2830). A P value of .05 was considered statistically significant. Statistical analysis was performed by using RStudio, version 1.2.5042 (https://rstudio.com/).

Results

Benign versus Malignant Classifier Performance Metrics

The CNN trained to distinguish between benign and malignant masses demonstrated a mean validation AUC of 0.85 ± 0.11 and validation accuracy of 78% ± 9 across the four folds of cross-validation. The best-performing model was tested on the holdout test set of 46 masses (68 images) and achieved an accuracy of 79% (54 of 68); a recall (sensitivity) of 90% (19 of 21) and 74% (35 of 47) for classifying malignant and benign masses, respectively; and a precision (positive predictive value) of 61% (19 of 31) and 95% (35 of 37) for classifying malignant and benign masses, respectively. The confusion matrix of the model and its performance metrics with 95% CIs are summarized in Figure 2.

Confusion matrix and performance metrics of the convolutional neural                         network model for classification of benign and malignant masses on the                         holdout test set. Model performance metrics are shown, with 95% CIs in                         parentheses.

Figure 2: Confusion matrix and performance metrics of the convolutional neural network model for classification of benign and malignant masses on the holdout test set. Model performance metrics are shown, with 95% CIs in parentheses.

The performance of the CNN on the test dataset was plotted on an ROC curve, which demonstrated an AUC of 0.91 (95% CI: 0.84, 0.98). The two experienced musculoskeletal radiologists classified the same test set with an accuracy of 82% (56 of 68) and 68% (46 of 68), sensitivity of 76% (16 of 21) and 90% (19 of 21), and specificity of 85% (40 of 47) and 57% (27 of 47), respectively. Comparison of the model’s performance with those of the two radiologists is shown on an ROC curve in Figure 3.

Receiver operating characteristic (ROC) curve of convolutional neural                         network (CNN) model compared with that of the two musculoskeletal                         radiologists. The CNN model demonstrated an area under the receiver                         operating characteristic (AUC) curve of 0.91.

Figure 3: Receiver operating characteristic (ROC) curve of convolutional neural network (CNN) model compared with that of the two musculoskeletal radiologists. The CNN model demonstrated an area under the receiver operating characteristic (AUC) curve of 0.91.

The sensitivity, specificity, and accuracy of the two readers were compared with that of the model by using the McNemar test, which demonstrated no difference in performance between the model and the readers. The results are summarized in Table 3.

Table 3: Comparison of Accuracy, Sensitivity, and Specificity of the Convolutional Neural Network Model for Predicting Malignant Masses in the Holdout Test Dataset with Two Expert Radiologists

Table 3:

Four sample images of masses in the test dataset and the model’s corresponding predications are shown in Figure 4. The selected images depict one example each of true-negative, true-positive, false-positive, and false-negative predictions by the model. In all four examples, the model prediction matched the predictions of both musculoskeletal radiologists.

Selected examples of masses on US images and the corresponding                         convolutional neural network model prediction. A, True-negative example of a                         lipoma that was correctly predicted to be a benign mass. B, True-positive                         example of a B-cell lymphoma that was correctly predicted to be a malignant                         mass. C, False-positive example of a schwannoma that was incorrectly                         predicted to be a malignant mass. D, False-negative example of an                         undifferentiated pleiomorphic sarcoma that was incorrectly predicted to be a                         benign mass. In all these cases, the model prediction matched those                         predictions of the two musculoskeletal radiologists.

Figure 4: Selected examples of masses on US images and the corresponding convolutional neural network model prediction. A, True-negative example of a lipoma that was correctly predicted to be a benign mass. B, True-positive example of a B-cell lymphoma that was correctly predicted to be a malignant mass. C, False-positive example of a schwannoma that was incorrectly predicted to be a malignant mass. D, False-negative example of an undifferentiated pleiomorphic sarcoma that was incorrectly predicted to be a benign mass. In all these cases, the model prediction matched those predictions of the two musculoskeletal radiologists.

Benign Mass Classifier Performance Metrics

The CNN trained to distinguish among lipomas, benign nerve sheath tumors, and vascular malformations demonstrated a mean weighted-average F1 score of 0.764 ± 0.069 and mean validation accuracy of 77% ± 7 across the four folds of cross-validation. The best-performing model across the four folds was tested on the holdout test set and achieved a weighted-average F1 score of 0.698 and accuracy of 71% (67 of 94). Precision and recall of the model were 78% (39 of 50) and 93% (39 of 42), respectively, for classification of lipomas; 71% (10 of 14) and 42% (10 of 24), respectively, for classification of benign nerve sheath tumors; and 60% (18 of 30) and 64% (18 of 28), respectively, for classification of vascular malformations. The confusion matrix of the model and its performance metrics are summarized in Figure 5.

Confusion matrix and performance metrics of the convolutional neural                         network model for classification of lipoma, benign nerve sheath tumor, and                         vascular malformation. Model performance metrics are shown, with 95% CIs in                         parentheses.

Figure 5: Confusion matrix and performance metrics of the convolutional neural network model for classification of lipoma, benign nerve sheath tumor, and vascular malformation. Model performance metrics are shown, with 95% CIs in parentheses.

Examples of masses correctly classified by the CNN model are provided in Figure 6. In each example, the mass presented to the model demonstrates classic US features of lipomas (Fig 6, A), benign nerve sheath tumors (Fig 6, B), and vascular malformations (Fig 6, C). Examples of incorrectly classified masses are shown in Figure 7. Two images of benign nerve sheath tumors were misclassified as a lipoma (Fig 7, A) and vascular malformation (Fig 7, B). An image of a hemangioma was misclassified as a lipoma (Fig 7, C). All masses shown in Figures 6 and 7 were diagnosed histologically by the results of a biopsy or surgical excision.

Examples of masses correctly classified by the convolutional neural                         network model. US images demonstrate the typical US appearance of, A,                         lipoma, B, benign nerve sheath tumor, and, C, vascular malformation. All                         examples were diagnosed histologically by the results of a biopsy or                         surgical excision.

Figure 6: Examples of masses correctly classified by the convolutional neural network model. US images demonstrate the typical US appearance of, A, lipoma, B, benign nerve sheath tumor, and, C, vascular malformation. All examples were diagnosed histologically by the results of a biopsy or surgical excision.

Examples of masses on US images incorrectly classified by the                         convolutional neural network model. Two examples of benign peripheral nerve                         sheath tumors misclassified as, A, lipoma and, B, vascular malformation. C,                         One example of a hemangioma misclassified as a lipoma.

Figure 7: Examples of masses on US images incorrectly classified by the convolutional neural network model. Two examples of benign peripheral nerve sheath tumors misclassified as, A, lipoma and, B, vascular malformation. C, One example of a hemangioma misclassified as a lipoma.

Discussion

In this study, we sought to train two CNN models to distinguish between benign and malignant masses, as well as among lipomas, benign nerve sheath tumors, and vascular malformations at US. The CNN model trained to classify benign and malignant masses demonstrated an accuracy of 79% on the test dataset, with a high sensitivity of 90% for identifying malignant masses. This was corroborated by a high positive predictive value of 95% for identifying benign masses. While the specificity of 74% of the model for predicting malignant masses was lower, its classification characteristics would be well suited to a screening environment in which masses being examined are triaged for further work-up. Compared with the findings of two expert musculoskeletal radiologists, there was no difference in accuracy (82% and 68%), sensitivity (76% and 90%), and specificity (85% and 57%). Plotting the performance of the model on an ROC curve demonstrated a high degree of discrimination between malignant and benign lesions, with an AUC of 0.91.

The CNN model trained to classify lipomas, benign nerve sheath tumors, and vascular malformations exhibited lower performance and demonstrated an accuracy of 71% and weighted-average F1 score of 0.698 on the test set. The model performed well with classification of lipomas, with a recall of 93% and precision of 78%. Poorer performance was seen with classifying benign nerve sheath tumors and vascular malformations, with a recall and precision of 42% and 71%, respectively, for benign nerve sheath tumors and 64% and 60%, respectively, for vascular malformations.

Lower performance of the model classifying benign masses can be partly attributed to the introduction of additional subgroups (three classes) when compared with the model classifying benign and malignant masses (two classes). With more classes for the CNN model to differentiate, the limitations of the small datasets used for training and evaluation are expected to become more apparent. Additionally, lower performance of the model may be secondary to overlapping US characteristics of the three subgroups of masses. Both benign nerve sheath tumors and vascular malformations can appear as focal hypoechoic masses, with high vascularity on Doppler images. Furthermore, many vascular malformations may demonstrate substantial fat content and low vascularity, resulting in an overlap with findings of a lipoma.

The main limitation of this study was the small size of the datasets used to train and evaluate the CNN models. While some of this shortfall was overcome with the use of transfer learning, data augmentation, and regularization, it is likely that we encountered the limits of what a small dataset can achieve with training a CNN. Also, owing to the retrospective nature of the study, we were unable to control for scanning parameters, such as Doppler sensitivity, in the extracted images. In addition, it was not possible to obtain coregistered grayscale and Doppler images.

Some selection bias may have been present in the dataset of masses used to train the CNN in distinguishing between benign and malignant masses. As the dataset included only masses with histologic diagnoses (by the results of a biopsy or surgical resection), there is likely a disproportionate fraction of malignant masses and masses with indeterminate imaging features when compared with that of the general population. This disproportion may result in the model favoring malignant predictions when implemented in a clinical setting. However, in practice, the clinical use case of such an algorithm would be to assist radiologists with screening and triage of suspicious or indeterminate masses, for which having an increased sensitivity for malignancy may be desired. Additional investigation of the performance of the CNN model on a prospectively acquired cohort of patients is needed to establish its value in a clinical setting.

Another limitation of the study relates to the design of the reader study with two expert radiologists. In the reader study, the radiologists were blinded to clinical history and were interpreting selected static grayscale and Doppler images of a mass. This contrasts with a typical clinical setting in which a radiologist has the ability both to obtain a good clinical history and to see and interpret the images in real time. Therefore, it is likely that the reader study underestimates the performance of expert radiologists practicing in a normal clinical setting.

Qualitative evaluation of cases of malignant and benign lesions incorrectly classified by the CNN model demonstrates that, in many cases, the human readers made the same classification error as the model. This result highlights the inherent limitations of using grayscale and Doppler features alone for characterization of soft-tissue masses. In many cases, benign and malignant masses may have similar grayscale features and Doppler characteristics. Larger scale multi-institution studies with inclusion of a larger database of masses and different US equipment are likely to improve the performance and generalizability of the CNN models. Furthermore, innovations in US technology, including shear-wave elastography, US contrast agents, and back-scatter data, will likely further improve the diagnostic capabilities of CNN models. Future directions of this study will involve incorporating prospectively acquired US images of soft-tissue masses with registered grayscale, Doppler, shear-wave elastography, and back-scatter data to train a CNN model. Further refinement of the model can also be achieved through implementation and training of a region selection algorithm, such as a region-based CNN, to automatically identify masses on a full–field-of-view US image, obviating the need for manual image cropping. Additionally, comparison of the performance of a CNN model to prospective interpretations of expert radiologists working in a clinical setting are needed to better establish the true relative performance of such models.

The applications of CNN models for characterization of masses depicted on US images have been demonstrated in the setting of thyroid nodule classification (13,14,31) and axillary node and breast mass classification in breast cancer (16,17). The applicability of CNN models in musculoskeletal US has also been shown in skeletal muscle characterization (21), myotendinous junction segmentation (22), and biceps tendinopathy grading (23). This study expands the use of CNN in musculoskeletal US by allowing classification of soft-tissue masses in the musculoskeletal system.

The results of this preliminary study demonstrate the feasibility of training CNN models to classify soft-tissue masses depicted on US images. In particular, the model trained to differentiate between benign and malignant masses achieved a high level of diagnostic performance, which likely can be further improved with additional training data and with the incorporation of new US technologies. Given its high sensitivity for predicting malignant masses, the trained CNN would be most suitable as a screening application to assist radiologists in the triage of masses for further clinical work-up. In conclusion, by using a small dataset, a CNN was trained to differentiate between benign and malignant soft-tissue masses depicted on US images, with performance matching that of two experienced musculoskeletal radiologists.

Disclosures of Conflicts of Interest: B.W. disclosed no relevant relationships. L.P. Activities related to the present article: grant from the French Agence régionale de sante” for a volunteer research internship in NYU Langone. Activities not related to the present article: disclosed no relevant relationships. Other relationships: disclosed no relevant relationships. C.B. disclosed no relevant relationships. R.S.A. disclosed no relevant relationships.

Author Contributions

Author contributions: Guarantors of integrity of entire study, B.W., R.S.A.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, B.W., L.P., C.B.; clinical studies, all authors; experimental studies, B.W.; statistical analysis, B.W., L.P.; and manuscript editing, B.W., C.B., R.S.A.

Authors declared no funding for this work.

References

  • 1. Carra BJ, Bui-Mansfield LT, O’Brien SD, Chen DC. Sonography of musculoskeletal soft-tissue masses: techniques, pearls, and pitfalls. AJR Am J Roentgenol 2014;202(6):1281–1290. Crossref, MedlineGoogle Scholar
  • 2. McNally EG. The development and clinical applications of musculoskeletal ultrasound. Skeletal Radiol 2011;40(9):1223–1231. Crossref, MedlineGoogle Scholar
  • 3. Lakkaraju A, Sinha R, Garikipati R, Edward S, Robinson P. Ultrasound for initial evaluation and triage of clinically suspicious soft-tissue masses. Clin Radiol 2009;64(6):615–621. Crossref, MedlineGoogle Scholar
  • 4. Hwang S, Adler RS. Sonographic evaluation of the musculoskeletal soft tissue masses. Ultrasound Q 2005;21(4):259–270. Crossref, MedlineGoogle Scholar
  • 5. Charnock M, Kotnis N, Fernando M, Wilkinson V. An assessment of Ultrasound screening for soft tissue lumps referred from primary care. Clin Radiol 2018;73(12):1025–1032. Crossref, MedlineGoogle Scholar
  • 6. Wagner JM, Lee KS, Rosas H, Kliewer MA. Accuracy of sonographic diagnosis of superficial masses. J Ultrasound Med 2013;32(8):1443–1450. Crossref, MedlineGoogle Scholar
  • 7. Grimer RJ, Briggs TWR. Earlier diagnosis of bone and soft-tissue tumours. J Bone Joint Surg Br 2010;92(11):1489–1492. Crossref, MedlineGoogle Scholar
  • 8. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv Neural Inf Process Syst 25. Red Hook, NY: Curran Associates, 2012; 1097–1105. Google Scholar
  • 9. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115–118 [Published correction appears in Nature 2017;546(7660):686.]. Crossref, MedlineGoogle Scholar
  • 10. Ciompi F, Chung K, Van Riel SJ, et al. Towards automatic pulmonary nodule management in lung cancer screening with deep learning. Sci Rep 2017;7:46479 [Published correction appears in Sci Rep 2017;7:46878.]. Crossref, MedlineGoogle Scholar
  • 11. Lakhani P, Sundaram B. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology 2017;284(2):574–582. LinkGoogle Scholar
  • 12. Rajpurkar P, Irvin J, Zhu K, et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 1711.05225 [preprint] https://arxiv.org/abs/1711.05225. Posted December 25, 2017. Version 1711.05225v3 . Google Scholar
  • 13. Wang L, Yang S, Yang S, et al. Automatic thyroid nodule recognition and diagnosis in ultrasound imaging with the YOLOv2 neural network. World J Surg Oncol 2019;17(1):12. Crossref, MedlineGoogle Scholar
  • 14. Nguyen DT, Kang JK, Pham TD, Batchuluun G, Park KR. Ultrasound Image-Based Diagnosis of Malignant Thyroid Nodule Using Artificial Intelligence. Sensors (Basel) 2020;20(7):E1822. Crossref, MedlineGoogle Scholar
  • 15. Buda M, Wildman-Tobriner B, Castor K, Hoang JK, Mazurowski MA. Deep Learning-Based Segmentation of Nodules in Thyroid Ultrasound: Improving Performance by Utilizing Markers Present in the Images. Ultrasound Med Biol 2020;46(2):415–421. Crossref, MedlineGoogle Scholar
  • 16. Sun Q, Lin X, Zhao Y, et al. Deep Learning vs. Radiomics for Predicting Axillary Lymph Node Metastasis of Breast Cancer Using Ultrasound Images: Don’t Forget the Peritumoral Region. Front Oncol 2020;10:53. Crossref, MedlineGoogle Scholar
  • 17. Tanaka H, Chiu SW, Watanabe T, Kaoku S, Yamaguchi T. Computer-aided diagnosis system for breast ultrasound images using deep learning. Phys Med Biol 2019;64(23):235013. Crossref, MedlineGoogle Scholar
  • 18. Hu SY, Xu H, Li Q, Telfer BA, Brattain LJ, Samir AE. Deep Learning-Based Automatic Endometrium Segmentation and Thickness Measurement for 2D Transvaginal Ultrasound. Annu Int Conf IEEE Eng Med Biol Soc 2019;2019:993–997. MedlineGoogle Scholar
  • 19. Liu D, Liu F, Xie X, et al. Accurate prediction of responses to transarterial chemoembolization for patients with hepatocellular carcinoma by using artificial intelligence in contrast-enhanced ultrasound. Eur Radiol 2020;30(4):2365–2376. Crossref, MedlineGoogle Scholar
  • 20. Orlando N, Gillies DJ, Gyacskov I, Romagnoli C, D’Souza D, Fenster A. Automatic prostate segmentation using deep learning on clinically diverse 3D transrectal ultrasound images. Med Phys 2020;47(6):2413–2426. Crossref, MedlineGoogle Scholar
  • 21. Cunningham RJ, Loram ID. Estimation of absolute states of human skeletal muscle via standard B-mode ultrasound imaging and deep convolutional neural networks. J R Soc Interface 2020;17(162):20190715. Crossref, MedlineGoogle Scholar
  • 22. Zhou GQ, Huo EZ, Yuan M, et al. A Single-Shot Region-Adaptive Network for Myotendinous Junction Segmentation in Muscular Ultrasound Images. IEEE Trans Ultrason Ferroelectr Freq Control 2020;3010(c):1–1. Google Scholar
  • 23. Lin BS, Chen JL, Tu YH, et al. Using Deep Learning in Ultrasound Imaging of Bicipital Peritendinous Effusion to Grade Inflammation Severity. IEEE J Biomed Health Inform 2020;24(4):1037–1045. Crossref, MedlineGoogle Scholar
  • 24. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv 1409.1556 [preprint] https://arxiv.org/abs/1409.1556. Posted April 10, 2015. Version 1409.1556v6. Google Scholar
  • 25. Chollet F. Keras. GitHub, 2015. Google Scholar
  • 26. Abadi M, Agarwal A, Barham P, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv 1603.04467 https://arxiv.org/abs/1603.04467. Posted March 16, 2016. Version 1603.04467v2. Google Scholar
  • 27. McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947;12(2):153–157. Crossref, MedlineGoogle Scholar
  • 28. Clopper CJ, Pearson ES. The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika 1934;26(4):404–413. CrossrefGoogle Scholar
  • 29. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837–845. Crossref, MedlineGoogle Scholar
  • 30. Sun X, Xu W. Fast Implementation of DeLong’s Algorithm for Comparing the Areas Under Correlated Receiver Operating Characteristic Curves. IEEE Signal Process Lett 2014;21(11):1389–1393. CrossrefGoogle Scholar
  • 31. Nguyen DT, Pham TD, Batchuluun G, Yoon HS, Park KR. Artificial Intelligence-Based Thyroid Nodule Classification Using Information from Spatial and Frequency Domains. J Clin Med 2019;8(11):1976. CrossrefGoogle Scholar

Article History

Received: May 27 2020
Revision requested: July 27 2020
Revision received: Oct 5 2020
Accepted: Oct 28 2020
Published online: Dec 02 2020