Published Online:https://doi.org/10.1148/ryai.2021210015

Abstract

Purpose

To construct and evaluate the efficacy of a deep learning system to rapidly and automatically locate six vertebral landmarks, which are used to measure vertebral body heights, and to output spine angle measurements (lumbar lordosis angles [LLAs]) across multiple modalities.

Materials and Methods

In this retrospective study, MR (n = 1123), CT (n = 137), and radiographic (n = 484) images were used from a wide variety of patient populations, ages, disease stages, bone densities, and interventions (n = 1744 total patients, 64 years ± 8, 76.8% women; images acquired 2005–2020). Trained annotators assessed images and generated data necessary for deformity analysis and for model development. A neural network model was then trained to output vertebral body landmarks for vertebral height measurement. The network was trained and validated on 898 MR, 110 CT, and 387 radiographic images and was then evaluated or tested on the remaining images for measuring deformities and LLAs. The Pearson correlation coefficient was used in reporting LLA measurements.

Results

On the holdout testing dataset (225 MR, 27 CT, and 97 radiographic images), the network was able to measure vertebral heights (mean height percentage of error ± 1 standard deviation: MR images, 1.5% ± 0.3; CT scans, 1.9% ± 0.2; radiographs, 1.7% ± 0.4) and produce other measures such as the LLA (mean absolute error: MR images, 2.90°; CT scans, 2.26°; radiographs, 3.60°) in less than 1.7 seconds across MR, CT, and radiographic imaging studies.

Conclusion

The developed network was able to rapidly measure morphometric quantities in vertebral bodies and output LLAs across multiple modalities.

Keywords: Computer Aided Diagnosis (CAD), MRI, CT, Spine, Demineralization-Bone, Feature Detection

Supplemental material is available for this article.

© RSNA, 2021

Summary

The developed deep learning system was capable of automatically measuring vertebral heights across MRI, CT, and radiographic studies to aid in opportunistic screening for anatomical deformities.

Key Points

  • ■ A deep learning system was developed to detect six landmarks on each vertebral body and to produce vertebral deformity diagnoses on individual vertebrae for MR, CT, and radiographic imaging studies.

  • ■ The neural network achieved less than a 3% mean height measurement error and output other measures (lumbar lordosis angle with a mean error < 3.6°) consistently across all modalities and spine regions.

  • ■ The network produced outputs in less than 1.7 seconds, on average, per imaging section by using publicly available, free computing power.

Introduction

Vertebral body deformities are a common outcome of bone diseases and vertebral fractures across the world. For example, osteoporosis, which affects 10 million individuals over the age of 50, causes nearly 50% of vertebral body deformities in the United States alone (13). These vertebral deformities reduce mobility, increase depression, and decrease quality of life among patients (4). However, only one-third of vertebral deformities are clinically diagnosed by using underlying radiologic studies (5). Analyzing deformities and the morphometric changes (ie, changes in the vertebral body shape) as a result of their incidence can help radiologists determine the existence of prevalent fractures and inform the future course of treatment.

The process of measuring vertebral deformities can be done using two methods: the Genant semiquantitative (SQ) method or the quantitative morphometric (QM) method. The Genant SQ method is used visually to assess vertebral fracture grades (normal, mild, moderate, or severe) and types (wedge, biconcave, or crush). For QM, a reader must place six landmarks around the vertebral body (corresponding to the anterior, middle, and posterior of the vertebral body). From these, vertebral heights are calculated for use in fracture identification algorithms. The SQ and QM methods have been compared in the assessment of fractures, and whereas QM can detect fractures, a confirmatory SQ read is needed to reduce false-positive results (6). Additionally, the QM method has not been widely used in clinical practice because it is time-consuming to mark vertebral key points (7). However, relying on the SQ method alone is not ideal, as scans must be brought to the attention of a radiologist to review, potentially leading to misses among cases that do not display clinical symptoms of fracture. This issue potentially contributes to the high rate of underreporting of vertebral fractures among radiologists (as high as 66%–85%) (810). A method that automates the QM process can be used to opportunistically prioritize scans for SQ grading by a radiologist, potentially reducing the underreporting of vertebral deformities.

Artificial intelligence models offer potential solutions for the automation of QM, as such algorithms are highly accurate at localizing objects and marking landmarks on images (11). We describe the development and efficacy of an artificial intelligence–based deep learning system, a neural network called SpineToolKit, for performing vertebral height measurements on sagittal spine MR and CT imaging studies, as well as on lateral radiographs, with an evaluation time of less than 2 seconds per imaging section when run on publicly available computing power. Additionally, we show that the resulting outputs from the network can yield other clinically useful measurements, such as, but not limited to, L1–L5 lumbar lordosis. This network offers the ability to opportunistically screen for vertebral deformities, reducing the underreporting of vertebral deformities.

Materials and Methods

System Overview

The final deep learning system (neural network) accomplished our goals by detecting vertebral bodies and finding the locations of six landmarks for each vertebra. These landmarks are used to measure vertebral body heights and the lumbar lordosis angle (LLA). An overview of the capabilities and potential applications of the neural network is shown in Figure 1, and an example of how the network can be used to prioritize scans for review can be found in the Movie. We trained a version of the SpineToolKit network for MR, CT, and radiographic images. The final system receives, as input, an imaging study. Then it detects and outputs six landmarks for each vertebral body. Finally, it outputs anterior, middle, and posterior vertebral heights for vertebral deformity quantification. Details about the network structure can be found in Figure 2.

Basic overview of our proposed network in this study. Currently, physicians detect deformities and fractures by relying on manually measuring vertebral bodies or inspecting imaging studies by eye (without measurement aid). Overall, these methods have led to only 33% of vertebral deformities being clinically diagnosed and a number of deformities being underreported (8,9). Our method uses a neural network to detect six landmarks on each vertebral body (in pink) as well as heights (in green) that produce height measurements (eg, red text by top vertebral body) for prioritization of scans for review. PACS = picture archiving and communication system.

Figure 1: Basic overview of our proposed network in this study. Currently, physicians detect deformities and fractures by relying on manually measuring vertebral bodies or inspecting imaging studies by eye (without measurement aid). Overall, these methods have led to only 33% of vertebral deformities being clinically diagnosed and a number of deformities being underreported (8,9). Our method uses a neural network to detect six landmarks on each vertebral body (in pink) as well as heights (in green) that produce height measurements (eg, red text by top vertebral body) for prioritization of scans for review. PACS = picture archiving and communication system.

Movie: A scan prioritization scheme that is based on outputs of the SpineToolKit algorithm. Scans in the first column are put into a central directory, which SpineToolKit checks every 100 msec. Upon “seeing” a new scan in the directory, SpineToolKit then predicts keypoint locations for vertebral bodies in the scan (and selects the most appropriate central section per the programmatic algorithm established in Appendix E3). From the keypoint locations, wedge, biconcave, and crush deformity values are calculated and stratified into values 0%–19%, 20%–24%, 25%–39%, and 40%+. To calculate a priority score, the vertebrae with wedge/biconcave/crush values greater than 40% receive a weighting of 4, those with values between 25% and 39% get a weighting of 3, those with values between 20% and 24% get a weighting of 2, and those with values in the range 0%–19% get a weighting of 1. The weighted average is then calculated and reported as the priority score. This system prioritizes individuals with a high number of higher-grade deformities for review as intended.

Basic outline of neural network. The network is composed of three parts: a feature generation network (FGN), a region recognition network (RRN), and a landmark detection network. The FGN performs several convolution (conv) and downsampling steps (in stages 1–4) to create features that can be learned on (dimensions of images after transformations are applied are reported as the number of features and the height by the width in pixels). Intermediate outputs from the FGN were each used to train the RRN, which produces an “objectness” logit map (showing the probability of an approximate region containing an object) and anchor deltas (which are preliminary bounding boxes for desired objects). Then the objectness map and the preliminary bounding boxes were combined (along with intermediate features from the FGN), bounding boxes were refined, and final boxes were classified as vertebral body or background. Next, the network marks landmarks for each detected object through a landmark region–based convolutional neural network. Three networks were trained in this study—one for each modality (MRI, CT, and radiography). Network architectures were adapted from mask region–based convolutional neural networks, and the layer names were changed for clarity. feats = features, wrt = with respect to.

Figure 2: Basic outline of neural network. The network is composed of three parts: a feature generation network (FGN), a region recognition network (RRN), and a landmark detection network. The FGN performs several convolution (conv) and downsampling steps (in stages 1–4) to create features that can be learned on (dimensions of images after transformations are applied are reported as the number of features and the height by the width in pixels). Intermediate outputs from the FGN were each used to train the RRN, which produces an “objectness” logit map (showing the probability of an approximate region containing an object) and anchor deltas (which are preliminary bounding boxes for desired objects). Then the objectness map and the preliminary bounding boxes were combined (along with intermediate features from the FGN), bounding boxes were refined, and final boxes were classified as vertebral body or background. Next, the network marks landmarks for each detected object through a landmark region–based convolutional neural network. Three networks were trained in this study—one for each modality (MRI, CT, and radiography). Network architectures were adapted from mask region–based convolutional neural networks, and the layer names were changed for clarity. feats = features, wrt = with respect to.

Dataset Description and Patient Population Characteristics

In this retrospective study, data used to train and test the network came from multiple centers. Institutional review board approval was obtained for downloading and analyzing de-identified imaging studies (Health Insurance Portability and Accountability Act approved). MRI studies (1153 patients) were obtained in men and women who underwent imaging (sagittal spine) at the Hospital of the University of Pennsylvania Radiology Center between January 1, 2005, and January 20, 2014, with indications of fracture, back pain, or osteoporosis. Radiographic imaging data (492 patient images; lateral spine–lumbar T12–T10 thoracic region) were also randomly selected from the Hospital of the University of Pennsylvania radiology archives (September 10, 2019, to June 15, 2020) for indications similar to those of the MRI studies. One hundred patient scans from the CT dataset (full spine) were obtained from an institutional collaborator. An additional 60 patient scans were obtained from a spine image segmentation challenge (VerSe 2019 [12,13]; public phase 1 and 2 data were used). Images (30 MR, 23 CT, and eight radiographic images) were excluded from the dataset for the following reasons: imaging study files were corrupt or images were of such low resolution that annotators could not locate vertebral body landmarks. Refer to Figure E1 (supplement) for a flow diagram of patient scans.

Image Acquisition

MRI.— T1- and T2-weighted images were acquired across more than 24 imagers. These imagers used a fast spin-echo sequence; magnetic field strength with a range of 1.5–3 T; a section thickness of 3–4 mm; a mean repetition time of 625.77 msec (range, 250–5000 msec); a mean echo time of 11.86 msec (range, 5.56–101.02 msec); a mean pixel size of 0.55 mm (range, 0.25–1.50 mm); and a matrix size with mean width, height, and row measurements of 487.56 (range, 256–1024).

CT.— CT examinations were performed with or without contrast enhancement; imaging data were acquired by using Philips Brilliance 64, iCT 256, and IQon scanners and Siemens Somatom Definition AS and AS+ scanners in helical mode, and a section thickness of 0.5–1 mm and a peak tube voltage of 120–140 kVp were used.

Radiography.— Spine radiographs were acquired by using numerous equipment types and imaging conditions and were resized to 3520 × 4280 pixels (0.19 mm per pixel).

Data Annotation and Processing

To generate training data for the neural network, each image was manually annotated with six landmarks for each visible vertebral body. Annotation was carried out by one of 24 trained research assistants (N.A., P.B., G.C., S.T., A.T., T.L., I.F., N.B., or H.C., with hundreds to thousands of hours of musculoskeletal annotation experience) and was reviewed for correctness by another trained research assistant (A.S., with >500 hours of musculoskeletal spine-specific annotation experience). For each vertebral body, six landmarks were annotated as per the QM method (refer to the right panels of Fig 3 for an example of points; refer also to Appendix E1 [supplement]). Only the imaging section with the highest number of visible vertebrae was selected for annotation (Appendix E2 [supplement]). Last, annotations were validated for accuracy against those of experienced radiologists (E.T. and B.J.K., with 26 and 39 years of musculoskeletal radiology experience, respectively; Appendix E1 [supplement]). On average, annotator landmarks differed from musculoskeletal annotator landmarks by 1.24 mm for MR images, 0.88 mm for CT scans, and 0.91 mm for radiographs, indicating that the research assistant annotations were consistent with those of experienced musculoskeletal radiologists. Training images and annotations were also augmented (replicated or randomly transformed) to increase the size of the training set (Appendixes E1, E3 [supplement]). Table 1 shows information about the number of sections in the training and testing sets and the number of vertebrae in each spine region in each dataset.

Landmark detection and deformity diagnosis. In the first column, relative errors are reported for ranges of vertebral bodies, ranging from L5 to C3; C1 and C2 were not included because it was impossible to detect Hm (see Fig E5 [supplement] for details on the Hm measurement). L6 was not annotated when present. Error percentages in the x and y direction are reported as a percentage of the vertebral body width and height, respectively. Box-and-whisker plots show the median, the 25th–75th percentile range, and 1.5 times the interquartile range. In the second column, the height measurement errors by deformity type are shown. The height measurement errors were stratified by the degree of calculated height reduction. In the third column, example outputs from the neural network are shown. Yellow dots are ground truth–annotated vertebral body landmarks, and magenta dots are landmarks predicted by the network. Text annotations next to vertebral bodies report the type of deformity and grade of deformity. Figure E3 (supplement) contains vertebra-by-vertebra relative landmark errors. Hm = middle height, QM = quantitative morphometry.

Figure 3: Landmark detection and deformity diagnosis. In the first column, relative errors are reported for ranges of vertebral bodies, ranging from L5 to C3; C1 and C2 were not included because it was impossible to detect Hm (see Fig E5 [supplement] for details on the Hm measurement). L6 was not annotated when present. Error percentages in the x and y direction are reported as a percentage of the vertebral body width and height, respectively. Box-and-whisker plots show the median, the 25th–75th percentile range, and 1.5 times the interquartile range. In the second column, the height measurement errors by deformity type are shown. The height measurement errors were stratified by the degree of calculated height reduction. In the third column, example outputs from the neural network are shown. Yellow dots are ground truth–annotated vertebral body landmarks, and magenta dots are landmarks predicted by the network. Text annotations next to vertebral bodies report the type of deformity and grade of deformity. Figure E3 (supplement) contains vertebra-by-vertebra relative landmark errors. Hm = middle height, QM = quantitative morphometry.

Table 1: Patient Data

Table 1:

Neural Network Training and Testing

The neural network was trained to detect vertebral bodies and produce six landmarks for each vertebral body. From these landmarks, calculations were carried out to determine the QM height measurements and LLAs. To train the network, cases were first assigned to a training or a holdout testing set. Training set cases were then split into five segments (for each modality) for use in fivefold cross-validation. The highest-performing fold (determined by using the vertebral body detection accuracy) was then evaluated on the held-out test set. Results from each of the training folds and the test set can be found in Table E1 (supplement). Network training was accomplished by using publicly available compute power (Google CoLab), and a pretrained feature extraction network was used to shorten training time (see Appendixes E4E7 [supplement] for details).

Statistical Analysis and Model Evaluation

Evaluation of the performance of the network was carried out by using multiple metrics. Further details can be found in Appendix E8 (supplement).

Network evaluation time.—We evaluated how long it took for the network to output landmarks for each modality by running the network on all testing data and measuring processing time for each input imaging section. We report the mean ± 1 standard deviation for these time measurements.

Section selection and vertebral body detection accuracy.—Because further analysis was predicated on the ability of the network to detect vertebral bodies, it was necessary to evaluate whether the network could select the correct section for measurement and then detect vertebral bodies.

To evaluate section selection accuracy, imaging studies from the test dataset (only MRI and CT studies) were input into their respective networks. On the basis of the output, an imaging section was automatically selected for evaluation based on a programmatic algorithm (see Appendix E3 [supplement]) and analyzed for correctness (ie, whether it chose the same section as the annotators). Accuracy and error distance (how far the automatically selected section was from the annotator's section) were reported. A one-sample t test value was calculated (null hypothesis: selected section error distance = 0) along with a P value (α = .05).

Each of the trained networks was also evaluated for its ability to detect vertebral bodies. A correct detection of a vertebral body was registered if the intersection over union (IoU) (Fig 4A) of the predicted versus ground truth bounding box was greater than or equal to 0.70. Additionally, a precision-recall curve was generated for each of the modalities (evaluates other IoU cutoffs; see Fig E4 [supplement]).

Object detection metrics. (A) The top left panel shows a graphic explaining the definition of intersection over union (IoU). For the remaining panels, we show data from the holdout test set. (B) We report a histogram for imaging section selections of the network in which bins correlate to the number of sections away from the ground truth–annotated section (negative numbers are to the left of the section, positive numbers are to the right, and zero is the exact same section as the manually picked section). (C) At an IoU threshold of 0.7 (chosen for its ability to produce the highest number of true-positive results while minimizing false-positive results), we also report individual accuracy levels for detecting vertebrae across modalities in different regions of the spine. The dataset did not contain any radiographs of the cervical region. (D) Last, we show some example outputs from the MRI, CT, and the radiographic networks, with vertebral bodies being outlined in white, green, and purple (images were brightened for print).

Figure 4: Object detection metrics. (A) The top left panel shows a graphic explaining the definition of intersection over union (IoU). For the remaining panels, we show data from the holdout test set. (B) We report a histogram for imaging section selections of the network in which bins correlate to the number of sections away from the ground truth–annotated section (negative numbers are to the left of the section, positive numbers are to the right, and zero is the exact same section as the manually picked section). (C) At an IoU threshold of 0.7 (chosen for its ability to produce the highest number of true-positive results while minimizing false-positive results), we also report individual accuracy levels for detecting vertebrae across modalities in different regions of the spine. The dataset did not contain any radiographs of the cervical region. (D) Last, we show some example outputs from the MRI, CT, and the radiographic networks, with vertebral bodies being outlined in white, green, and purple (images were brightened for print).

Landmark and deformity detection accuracy.—To evaluate the accuracy of the landmark location predictions from the neural network across all regions of the spine, we evaluated how close the network-predicted landmarks were to their manually annotated ground truth locations in the testing set. For each of the landmark locations, a percentage of error was determined in the x and y direction relative to the width and height of the vertebral body to account for different vertebral body sizes (Fig 3).

From these landmark locations, we then programmatically measured deformities as per formulas (14) defined in Figure E5 (supplement). On the basis of these calculations, we calculated the wedge, biconcave, and crush deformity values for each vertebral body (only the most severe deformities were noted). We then stratified network-predicted height measurement results by wedge, biconcave, and crush deformity value ranges (<20%, 20%–25%, 25%–40%, >40%) for interpretability.

LLA measurement.—As a measure of the additional clinical applicability of our network, we also evaluated how the network outputs can be used to measure LLA (see left panel of Fig 5). These measurements can provide additional information regarding general spine health status and other bone diseases (15). By using the detected landmarks for the L1 bottom corners and the L5 top corners, we calculated LLA from the network (by using a Python function that calculates the arccosine of the L1–L5 vector pair dot product) and compared it with the ground truth LLA. We then determined whether there was correlation (Pearson correlation coefficient) between the predicted LLA and the ground truth LLA and whether it was statistically significant (α = .05, null hypothesis: no correlation). Note that most LLA calculations are performed in the standing position; we reported the LLAs from supine images (CT, MRI).

Lumbar lordosis angle (LLA) calculations. Graphs using datapoints from the test set are shown (r values from other folds are shown in Table E1 [supplement]). The left panel shows a schematic of how the L1–L5 LLA was calculated (red angle = LLA). Highlighted magenta landmarks were relevant for calculation, and translucent landmarks were detected by the network but were not used in the calculation. The right three panels are graphs that show the correlation between the network-predicted LLA and the ground truth LLA. Points correspond to individual measurements in the test set for each modality.

Figure 5: Lumbar lordosis angle (LLA) calculations. Graphs using datapoints from the test set are shown (r values from other folds are shown in Table E1 [supplement]). The left panel shows a schematic of how the L1–L5 LLA was calculated (red angle = LLA). Highlighted magenta landmarks were relevant for calculation, and translucent landmarks were detected by the network but were not used in the calculation. The right three panels are graphs that show the correlation between the network-predicted LLA and the ground truth LLA. Points correspond to individual measurements in the test set for each modality.

Results

Patients in the MRI dataset were mostly women (988 of 1123; 88%) and had an average age of 67 years ± 11 (T scores, 0.36 ± 2.1). Approximately 17% of these individuals had osteopenia (−2.5 < T score < −1), and 9% had osteoporosis (T score ≤ −2.5). Patients in the CT dataset were mostly women (92 of 137; 67%) aged 65 years ± 5. Patients in the radiography dataset were nearly evenly split between men and women (261 of 484; 54% women) aged 57 years ± 17 (incomplete T score data for radiographs and CT scans). Refer to Table 1 for patient characteristics.

Network Evaluation Time

Evaluation time (measured as the number of seconds it took for the network to evaluate an image from the testing dataset) reached an average of 1.432 seconds ± 0.234 across the models (MR images, 1.311 seconds ± 0.145; CT scans, 1.394 seconds ± 0.189; radiographs, 1.492 seconds ± 0.121).

Vertebral Body Detection Accuracy

Image section selection accuracy was evaluated across all folds for both cross-sectional modalities. The average accuracy across all folds for MR images was 91.5% on the test set. For CT images, the average across all folds was 92.6%. At maximum, the network selected a section that was one section away from that selected by the human annotator for MR and CT imaging studies. For all folds, nonsignificant differences in the section selection number compared with that of the human annotator (all P > .05) were reported, indicating that the automated section selection procedure was not significantly different compared with the section selected by human annotators. A histogram of section selection results from the test set is reported in Figure 4B, and individual fold results are reported in Table E1 (supplement).

For all subsequent evaluation procedures, the network was evaluated only on the same section as the human annotators. We note that this decision may affect the interpretation of the final accuracy statistics. Accuracy of the vertebral body object detector was evaluated at an IoU of 0.70 (accuracy equals the number of detected vertebrae with an IoU ≥ 0.70 divided by the number of ground truth vertebrae in the image). Across all folds, modalities, and regions of the spine, the networks achieved an average of 97.0% accuracy for detecting vertebral bodies (range, 95%–98%; see Table E1 [supplement]). Accuracy varied across the three regions of the spine (as shown in Fig 4C) for each of the networks but exceeded 90% accuracy in each of the regions across all folds.

To analyze where the network failed to detect data, manual review was undertaken on the test set. This analysis examined the vertebrae that did not reach the IoU threshold and grouped these missed vertebrae into four categories (Table 2). For the MRI landmark network, most vertebrae that were not classified as correct detections were actually detected but were below the IoU threshold. For the CT network, the majority of deformities came from imaging artifacts. For the radiography network, most deformities came from obstructive devices (such as implanted screws and leads) still visible on the image that occluded vertebral bodies. Examples of some of the artifacts are shown in Figure E6 (supplement).

Table 2: Common Misses by Landmark Models

Table 2:

Landmark and Deformity Detection Accuracy

One of the primary capabilities of SpineToolKit is its ability to detect six landmarks on each vertebral body that correspond to the landmarks necessary to calculate heights for deformity diagnosis. For the MRI landmark network, the median x relative error was 4.1% (interquartile range [IQR], 3.6%–4.6%), and the median y relative error was 4.4% (IQR, 3.8%–5.1%) over all regions of the spine. For the CT landmark network, the median x error was 4.1% (IQR, 3.3%–4.8%), and the median y error was 4.5% (IQR, 3.9%–4.8%) over all regions of the spine. Last, for the radiographic landmark network, the median x error was 3.3% (IQR, 3.0%–3.7%), and the median y error was 3.6% (IQR, 3.1%–4.0%) over all regions of the spine (see Table E1 [supplement] and the left column of Fig 3 for individual fold data).

The overall average relative height measurement errors were as follows: MRI landmark, 1.5% ± 0.3; CT landmark, 1.9% ± 0.2; and radiography landmark, 1.7% ± 0.4. For each modality, the relative height measurement error generally increased as the degree of the deformity increased; however, they remained low (all below 3%). Figure 3 shows (for the test set) the relative landmark errors in the x and y direction for all regions, the relative height measurement errors stratified by deformity types and degrees of deformity, and sample outputs from the network. Table E1 (supplement) shows individual data for the test data and each cross-validation fold as well.

LLA Measurement

A linear correlation between the ground truth LLA and network-predicted LLA was calculated for evaluation. The landmark networks achieved a Pearson correlation coefficient (r value) of 0.95 for MR images, 0.97 for CT scans, and 0.95 for radiographs. For all networks, the null hypothesis (ie, no correlation) was rejected (P < .001, Pearson correlation test). For the MRI landmark detection network, the average mean absolute error (MAE; average of the absolute difference between the predicted angle and the ground truth angle) across all folds was 2.90°, and for values outside of the normal LLA range (20°–45°), the average MAE across all folds was 3.02°. For the CT landmark detection network, the average MAE across all folds was 2.26°, and the average MAE for values outside the normal LLA range was 1.88°. For the radiographic landmark detection network, the average MAE across all folds was 3.60°, and the average MAE for values outside the normal LLA range was 3.71°. Figure 5 shows plots of predicted versus ground truth–derived LLAs for data from the testing set. LLA data can be found in Table E1 (supplement).

Discussion

In this study, we developed a highly accurate neural network that works in multiple modalities (MRI, CT, radiography) to detect landmarks necessary for QM assessment and LLA calculations on vertebral bodies across all regions of the spine. In the landmark detection networks, we found that vertebral body landmark localization errors in the x and y direction were below 10% and that average height measurement errors were less than 3%. Additionally, L1–L5 LLA measurements yielded a high correlation coefficient (r ≥ 0.95 for all modalities) and a low mean absolute angle measurement error compared with the ground truth angle (<3.6°), indicating that the landmarks the network outputs can be used for measuring other metrics in a highly accurate manner.

In comparison with prior literature, our deep learning system either matches or improves on other neural networks, machine learning architectures, and clinical workflow tools in the aforementioned tasks. In the field of landmark localization within vertebral bodies, Ebrahimi et al (16) trained a random forest classifier to detect vertebral body corners on 109 sagittal spine radiographs. Their network was trained on 49 patients, aged 6–69 years, who were either asymptomatic or had diagnosed scoliosis. For lumbar scans, their corner position mean localization root mean squared error was 1.7 mm. We calculated that our mean localization root mean squared error was 0.73 mm on the same task. In addition to having a lower landmark detection accuracy, their approach requires that separate classifiers be trained on different regions of the spine, whereas ours is region agnostic.

In the field of automated QM, the SpineAnalyzer clinical workflow tool has been used for partially automating QM reads of radiographs, dual-energy x-ray absorptiometry images, or CT scans. However, this system requires a reader to start the process by manually placing points at the center of each vertebral body. SpineAnalyzer-measured heights were 2.2%–3.6% larger than those measured by using manual annotation (7). In comparison, our method is fully automated and achieved an average height measurement error of less than 2% across all modalities. Furthermore, SpineAnalyzer only was validated to work on T4–L4, whereas our method has shown low height percentages of error regardless of the spine region. Additionally, SpineAnalyzer's analysis time averaged at 5.4 minutes ± 1.7 per patient, whereas our tool could automatically analyze scans in 2 seconds or less (17).

In the field of automated angle measurements, Masad et al (18) created a decision tree classifier to measure the LLA on T2-weighted, sagittal, lumbar-region spine MR images. Their classifier was trained on five images (a mix of men and women who were either healthy or had diagnosed hypolordosis or hyperlordosis) and was evaluated on 27 images. They report a correlation coefficient with the ground truth LLA of 0.932. Our MRI network reached a correlation coefficient of 0.95. Schwartz et al (19) proposed a solution to automating the measurement of the LLA on radiographs by using a U-Net architecture that was trained on 652 lumbar sagittal radiographs (adult patients who were 18 years or older with close vertebral endplate physis) and was evaluated on 164 patients (images had tumors or severe artifacts). Measuring from the top endplate of L1 to the top endplate of S1, the model achieved a mean angle error of 4.30°. Our model achieved a mean angle error of 3.35° on radiographs in the lumbar region (although we measured to the bottom of L5).

Our developed model has important potential impacts in clinical medicine, as it has the ability to evaluate images quickly (<2 seconds per section) for deformities with inexpensive hardware. From the outputs of the network, it is possible to prioritize images for further review on the basis of the magnitude of vertebral height measurements (as shown in the Movie). Opportunistic screening for other bone conditions can be done by using outputs from the network if those bone diseases result in vertebral deformities or LLA measurement changes. Last, given that our network could be trained in less than 30 minutes on publicly available compute power, we are confident that this network design can be extended on to report measurements for other imaging modalities.

Limitations of the network primarily arose from the training data itself. From a dataset-specific perspective, information was not available on the clinical indications for undergoing scanning in parts of the CT dataset, limiting potential interpretability for some patient populations. For the radiographic dataset, no patient images were available for the cervical region, potentially limiting applicability of the network for images focusing on that region (although the network can be retrained with that data). Additionally, we reported that missed objects came from fused vertebrae, imaging artifacts, or obstructive devices. A future study can rectify this issue by training the network on a dataset that contains more artifacts. Last, the section selection process may not work well for cases of severe scoliosis, as no single section would capture all vertebral bodies. These cases would require human intervention to manually select other sections for landmark detection and deformity evaluation. At a higher level, our datasets skewed toward women and an older patient population. Although these attributes are common among those with diagnosed osteoporosis (who are more likely to exhibit vertebral deformities), results from the network may not apply to individuals from populations different from the ones used to train the networks.

Overall, the developed SpineToolKit model can detect vertebral bodies, locate landmarks on vertebral bodies, measure vertebral heights and deformities, and measure the LLA. Furthermore, it can do these tasks rapidly and accurately across MR images, CT scans, and radiographs. In the development of this network, we also provide a proof of concept for the clinical applications of deep learning–based object and landmark detectors. Last, we note that the architecture we use can also be repurposed and retrained for other tasks in radiologic image analysis. Future directions for this study include adapting the network to produce segmentations (for even more useful measurements) and examining its ability to opportunistically screen for vertebral deformities in a clinical care setting.

Disclosures of conflicts of interest: A.S. No relevant relationships. B.C.J. Trainee editorial board member of Radiology: Artificial Intelligence. G.N. No relevant relationships. N.A. No relevant relationships. P.B. No relevant relationships. A.D. No relevant relationships. G.C. No relevant relationships. S.T. No relevant relationships. A.T. No relevant relationships. T.L. No relevant relationships. I.F. No relevant relationships. N.B. Consulting fee or honorarium from University of Pennsylvania; patent with University of Pennsylvania. H.C. No relevant relationships. E.T. No relevant relationships. B.J.K. No relevant relationships. C.S.R. Consulting fee or honorarium from University of Pennsylvania.

Author Contributions

Author contributions: Guarantors of integrity of entire study, A.S., P.B., C.S.R.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, A.S., B.C.J., G.N., T.L., E.T., C.S.R.; clinical studies, E.T., C.S.R.; statistical analysis, A.S., G.N., A.D., I.F., N.B., C.S.R.; and manuscript editing, A.S., B.C.J., N.A., G.C., I.F., C.S.R.

Authors declared no funding for this work.

References

  • 1. Office of the U.S. Surgeon General . Bone health and osteoporosis: a report of the surgeon general. Washington, DC: Office of the Surgeon General, 2004.
  • 2. Wong CC, McGirt MJ. Vertebral compression fractures: a review of current management and multimodal therapy. J Multidiscip Healthc 2013;6:205–214.
  • 3. Weaver J, Sajjan S, Lewiecki EM, Harris ST, Marvos P. Prevalence and cost of subsequent fractures among U.S. patients with an incident fracture. J Manag Care Spec Pharm 2017;23(4):461–471.
  • 4. Silverman SL, Minshall ME, Shen W, Harper KD, Xie S; Health-Related Quality of Life Subgroup of the Multiple Outcomes of Raloxifene Evaluation Study. The relationship of health-related quality of life to prevalent and incident vertebral fractures in postmenopausal women with osteoporosis: results from the Multiple Outcomes of Raloxifene Evaluation Study. Arthritis Rheum 2001;44(11):2611–2619.
  • 5. Fink HA, Milavetz DL, Palermo L, et al. What proportion of incident radiographic vertebral deformities is clinically diagnosed and vice versa? J Bone Miner Res 2005;20(7):1216–1222.
  • 6. Wu CY, Li J, Jergas M, Genant HK. Comparison of semiquantitative and quantitative techniques for the assessment of prevalent and incident vertebral fractures. Osteoporos Int 1995;5(5):354–370.
  • 7. Engelke K, Stampa B, Steiger P, Fuerst T, Genant HK. Automated quantitative morphometry of vertebral heights on spinal radiographs: comparison of a clinical workflow tool with standard 6-point morphometry. Arch Osteoporos 2019;14(1):18.
  • 8. Salaffi F, Cimmino MA, Malavolta N, et al. The burden of prevalent fractures on health-related quality of life in postmenopausal women with osteoporosis: the IMOF study. J Rheumatol 2007;34(7):1551–1560.
  • 9. Cooper C, Atkinson EJ, O'Fallon WM, Melton LJ 3rd. Incidence of clinically diagnosed vertebral fractures: a population-based study in Rochester, Minnesota, 1985-1989. J Bone Miner Res 1992;7(2):221–227.
  • 10. Bartalena T, Giannelli G, Rinaldi MF, et al. Prevalence of thoracolumbar vertebral fractures on multidetector CT: underreporting by radiologists. Eur J Radiol 2009;69(3):555–559.
  • 11. He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. ArXiv 1703.06870 [preprint] https://arxiv.org/abs/1703.06870. Posted March 20, 2017. Accessed January 5, 2021.
  • 12. Sekuboyina A, Husseini ME, Bayat A, et al. VerSe: a vertebrae labelling and segmentation benchmark. ArXiv 2001.09193 [preprint] https://arxiv.org/abs/2001.09193. Posted January 24, 2020. Accessed January 5, 2021.
  • 13. Löffler MT, Sekuboyina A, Jacob A, et al. A vertebral segmentation dataset with fracture grading. Radiol Artif Intell 2020;2(4):e190138.
  • 14. Rajapakse CS, Phillips EA, Sun W, et al. Vertebral deformities and fractures are associated with MRI and pQCT measures obtained at the distal tibia and radius of postmenopausal women. Osteoporos Int 2014;25(3):973–982.
  • 15. Cunha-Henriques S, Costa-Paiva L, Pinto-Neto AM, Fonsechi-Carvesan G, Nanni L, Morais SS. Postmenopausal women with osteoporosis and musculoskeletal status: a comparative cross-sectional study. J Clin Med Res 2011;3(4):168–176.
  • 16. Ebrahimi S, Gajny L, Skalli W, Angelini E. Vertebral corners detection on sagittal x-rays based on shape modelling, random forest classifiers and dedicated visual features. Comput Methods Biomech Biomed Eng Imaging Vis 2019;7(2):132–144.
  • 17. Kim YM, Demissie S, Eisenberg R, Samelson EJ, Kiel DP, Bouxsein ML. Intra-and inter-reader reliability of semi-automated quantitative morphometry measurements and vertebral fracture assessment using lateral scout views from computed tomography. Osteoporos Int 2011;22(10):2677–2688.
  • 18. Masad IS, Al-Fahoum A, Abu-Qasmieh I. Automated measurements of lumbar lordosis in T2-MR images using decision tree classifier and morphological image processing. Eng Sci Technol Int J 2019;22(4):1027–1034.
  • 19. Schwartz JT, Cho BH, Tang P, et al. Deep learning automates measurement of spinopelvic parameters on lateral lumbar radiographs. Spine (Phila Pa 1976) 2021;46(12):E671–E678.

Article History

Received: Jan 12 2021
Revision requested: Feb 19 2021
Revision received: Sept 12 2021
Accepted: Oct 22 2021
Published online: Nov 10 2021