Original ResearchFree Access

Deep Learning for Automatic Calcium Scoring in CT: Validation Using Multiple Cardiac CT and Chest CT Protocols

Published Online:https://doi.org/10.1148/radiol.2020191621

Abstract

Background

Although several deep learning (DL) calcium scoring methods have achieved excellent performance for specific CT protocols, their performance in a range of CT examination types is unknown.

Purpose

To evaluate the performance of a DL method for automatic calcium scoring across a wide range of CT examination types and to investigate whether the method can adapt to different types of CT examinations when representative images are added to the existing training data set.

Materials and Methods

The study included 7240 participants who underwent various types of nonenhanced CT examinations that included the heart: coronary artery calcium (CAC) scoring CT, diagnostic CT of the chest, PET attenuation correction CT, radiation therapy treatment planning CT, CAC screening CT, and low-dose CT of the chest. CAC and thoracic aorta calcification (TAC) were quantified using a convolutional neural network trained with (a) 1181 low-dose chest CT examinations (baseline), (b) a small set of examinations of the respective type supplemented to the baseline (data specific), and (c) a combination of examinations of all available types (combined). Supplemental training sets contained 199–568 CT images depending on the calcium burden of each population. The DL algorithm performance was evaluated with intraclass correlation coefficients (ICCs) between DL and manual (Agatston) CAC and (volume) TAC scoring and with linearly weighted κ values for cardiovascular risk categories (Agatston score; cardiovascular disease risk categories: 0, 1–10, 11–100, 101–400, >400).

Results

At baseline, the DL algorithm yielded ICCs of 0.79–0.97 for CAC and 0.66–0.98 for TAC across the range of different types of CT examinations. ICCs improved to 0.84–0.99 (CAC) and 0.92–0.99 (TAC) for CT protocol–specific training and to 0.85–0.99 (CAC) and 0.96–0.99 (TAC) for combined training. For assignment of cardiovascular disease risk category, the κ value for all test CT scans was 0.90 (95% confidence interval [CI]: 0.89, 0.91) for the baseline training. It increased to 0.92 (95% CI: 0.91, 0.93) for both data-specific and combined training.

Conclusion

A deep learning calcium scoring algorithm for quantification of coronary and thoracic calcium was robust, despite substantial differences in CT protocol and variations in subject population. Augmenting the algorithm training with CT protocol–specific images further improved algorithm performance.

© RSNA, 2020

See also the editorial by Vannier in this issue.

Summary

A deep learning calcium scoring method can accurately determine cardiovascular risk from a range of CT scans performed with different protocols.

Key Results

  • ■ A deep learning algorithm for coronary calcium scoring showed good agreement with manual scoring (intraclass correlation coefficient [ICC], 0.79–0.97), even in evaluation of non–cardiac-dedicated CT examinations (eg, low-dose CT, diagnostic CT of the chest, PET attenuation correction CT, radiation therapy planning CT).

  • ■ Additional training with protocol-specific CT scans further improved performance of the deep learning algorithm for calcium scoring (ICC, 0.84–0.99) compared with manual scoring.

Introduction

Coronary artery calcium (CAC) scoring in dedicated CT examinations is frequently performed to measure coronary atherosclerotic plaque burden and predict cardiovascular disease (CVD) risk. The CAC score may have increasing applications, as noted in the 2018–2019 American Heart Association and American College of Cardiology Guidelines for Cholesterol and Prevention, whereby the CAC score is a tool to refine the 10-year risk of atherosclerotic CVD when the CVD risk may be uncertain (1). CAC scoring is performed by using nonenhanced electrocardiographically synchronized cardiac CT; however, multiple reports have indicated that application of coronary calcium scoring on CT scans obtained with nondedicated protocols enables CVD risk prediction (2). Accordingly, in addition to automatic methods for calcium scoring in nonenhanced electrocardiographically synchronized cardiac CT (36), methods were developed for automatic scoring in low-dose chest CT for lung cancer screening (710), CT for planning of radiation therapy treatment, (11,12) and PET attenuation correction (ACPET) CT (13). These CT examinations all include the heart; however, they vary widely in resolution, field of view, reconstruction kernels, noise characteristics, and the presence of electrocardiographic synchronization (Fig 1). Moreover, thoracic aorta calcification (TAC) is a predictor of CVD (14); therefore, several methods for automatic quantification of TAC in CT have also been developed (10,15). Although a number of these methods achieved a performance level that was close to human performance in the type of CT for which they were developed, their performance in different types of CT has not been established.

Figure 1:

Figure 1: CT images are examples of lung screening CT from the National Lung Screening Trial (NLST), coronary artery calcium scoring CT (CAC-CT), PET attenuation correction (ACPET) CT, diagnostic CT of the chest, radiation therapy treatment planning (RadTherapy) CT, and CT examinations from the Jackson Heart Study (JHS). Indication of coronary artery calcium in the left anterior descending artery (black arrow), left circumflex artery (white arrowhead), right coronary artery (black arrowhead), and aorta (white arrows).

Despite many efforts supporting the development of robust deep learning (DL) software (16,17), an established limitation of DL algorithms is that performance can rapidly degrade and fail with what appears to humans as only slight variations in the input data (18). Thus, the purpose of this study was to evaluate the performance of a DL method for automatic calcium scoring (10) across a wide range of CT examinations that depict the heart and thoracic aorta. Furthermore, we investigated whether addition of a low number of representative examples to the baseline training data was sufficient for the DL method to adapt to different types of CT examinations.

Materials and Methods

Image Data Sets

We included six existing data sets with 7240 individuals who underwent nonenhanced CT depicting the heart (Table 1). These included 902 consecutive electrocardiographically triggered cardiac CT scans for calcium scoring (CAC CT) (6) used in clinical practice, 399 cardiac ACPET CT scans from the Myocardial Ischaemia Detection by Circulating Biomarkers study (19) acquired at rest during radionuclide myocardial perfusion imaging, 1409 consecutive radiation therapy treatment planning (hereafter, RadTherapy) CT scans in patients with breast cancer from the Utrecht Cohort for Multiple Breast Cancer Intervention Studies and Long-Term Evaluation (11), 470 consecutive diagnostic chest CT scans, 2879 electrocardiographically gated CAC screening CT scans from the Jackson Heart Study (JHS) (20,21), and 1181 lung-screening low-dose chest CT scans from the National Lung Screening Trial (NLST) (22).

Table 1: Characteristics of Data Sets

Table 1:

The ACPET, JHS, RadTherapy, and NLST images were acquired as part of ethics committee–approved studies in which informed consent was obtained from all participants. The CAC CT and diagnostic chest CT examinations originated from University Medical Center Utrecht and were retrospectively and anonymously collected; therefore, the need for informed consent was waived by our institutional review board.

Reference Calcium Scores

CAC and TAC were semiautomatically labeled by four trained observers, each with prior experience (>500 examinations), without overlap between observers in the sets. The observers were supervised by an expert radiologist (P.A.d.J., 10 years of experience) who was consulted in case of doubt. In-house custom-built software for semiautomatic calcium scoring was used (6,12). In brief, the software for calcium scoring showed all regions of three or more adjacent voxels with attenuation above 130 HU with a colored overlay. The observer manually identified lesions with a mouse click and labeled them according to their anatomic location: left anterior descending artery (including the left main coronary artery), left circumflex artery, right coronary artery, and thoracic aorta. Subsequently, three-dimensional component labeling was performed to mark all connected voxels in the lesion as calcification. Calcifications in the diagonal and obtuse marginal branches were included with the left anterior descending artery and left circumflex artery, respectively. Excessive image noise and large metal implants interfered with semiautomatic segmentation of calcium in 16 of 291 (5%) diagnostic chest CT examinations. These examinations were excluded from the test set.

In the JHS, reference CAC scores were obtained by semiautomatically determining the artery trajectory and segmenting calcifications within this trajectory (21). TAC was not quantified in the JHS, and reference TAC scores were not available for this data set. As representative examples for training the automatic method, 400 examinations were reannotated by using the standard protocol.

In the NLST CT examinations, we used the previously published manual calcium annotations (10). In these examinations, there was poor signal-to-noise ratio, making semiautomatic segmentation infeasible.

Automatic Calcium Scoring Using DL

The DL network (Fig 2) for automatic calcium scoring (10) consisted of two consecutive convolutional neural networks that detected, quantified, and labeled calcifications according to their anatomic location (left anterior descending artery, right coronary artery, left circumflex artery, and thoracic aorta). Off-target calcifications in the heart, such as valvular calcification, were assigned anatomic locations but were not included in this study. The code is publicly available at https://github.com/sgmvanvelzen/calcium-scoring.

Figure 2:

Figure 2: Images show architecture of the deep learning calcium scoring algorithm. Algorithm consists of two convolutional neural networks (CNNs). The first CNN has a large field of view and detects candidate calcifications (voxels) on the image and labels them according to their anatomic location. The second CNN has a smaller field of view and detects true calcified voxels among candidates detected by the first CNN. LAD = left anterior descending artery, LCX = left circumflex artery, RCA = right coronary artery, TAC = thoracic aorta calcification.

Two modifications were made to the previously published DL method (10). First, to standardize the field of view, examinations with a field of view larger than originally used in the NLST examinations (RadTherapy, ACPET, and diagnostic chest CT) were preprocessed with a convolutional neural network–based localization method (23). Next, to match the standard calcium scoring protocol in which calcifications were defined as lesions with attenuation above 130 HU (2), we postprocessed the results by identifying three-dimensional connected clusters of voxels with attenuation above 130 HU. We discarded lesions with less than 25% of the voxels labeled as calcium by the network, as these were likely false-positive findings. Moreover, lesions larger than 10 000 mm3 were discarded because these were likely connected to bone or foreign objects, such as pacemaker wires.

CAC and TAC were quantified by using the DL method trained with (a) 1181 low-dose chest CT images from the NLST (baseline), (b) a small set of images of the respective examination type supplemented to the baseline (data specific), and (c) a combination of all available examination types (combined) (Fig 3). (Although the architecture of the network remained unchanged, we refer to the networks with baseline training, with data-specific training, and with combined training as the baseline, data-specific, and combined networks, respectively.) The baseline training set was used for the original development of the DL algorithm for calcium scoring in NLST CT (10). For data-specific training, randomly selected examinations in each data set were reserved (Table 1). To ensure that sufficient CAC and TAC examples from the target CT type were available for training and validation, the number of reserved examinations was based on the calcium burden in the respective population (Table 1). For each data set, data-specific network training was performed by combining the baseline images with the training examinations from the respective data set. To evaluate whether one DL network could be applied to all types of CT, combined training was performed with all available training examinations, except the diagnostic chest examinations, because these often involved severe disease conditions that were not present in other examinations. To allow for direct comparison, the performance of the baseline, data-specific, and combined network was evaluated by using the same test set. There was no overlap between training and test data.

Figure 3:

Figure 3: Illustration depicts training and evaluation of baseline, data-specific, and combined algorithms. Baseline algorithm was trained with National Lung Screening Trial (NLST) scans, and its performance was evaluated in each CT protocol type. Five data-specific algorithms were trained, one specifically for each CT protocol type, and evaluated in respective CT type. Combined algorithm was trained with a combination of all available CT protocol types (excluding diagnostic chest CT), and its performance was evaluated in all available CT protocol types. CT types used for training were NLST CT examinations, coronary artery calcium scoring CT (CAC-CT), PET attenuation correction (ACPET) CT, diagnostic chest CT, radiation therapy treatment planning (RTP) CT, and CT examinations from the Jackson Heart Study (JHS).

The network was trained with parameters, as described by Lessmann et al (10). In brief, the first convolutional neural network was trained for 1000 iterations (batch size of 32) and the second was trained for 750 iterations (batch size of 64), after which training was converged in all cases. During training, batches of examples were randomly selected from the training examinations. Because of the very low calcium burden of women undergoing radiation therapy for breast cancer, we enforced 20% RadTherapy examples per batch during RadTherapy-specific training.

Statistical Analyses

The volume of calcification (in cubic millimeters) was determined. The calcium score was computed by using the Agatston method (24). The Agatston score was used to stratify participants into five commonly used CVD risk groups: 0, 1–10, 11–100, 101–400, and greater than 400 (25).

For each network, average sensitivity and average false-positive calcium volume per CT type were calculated. Reliability of volume and Agatston scores was assessed with the intraclass correlation coefficient (ICC) between automatically and manually obtained scores, and their agreement was assessed by examining Bland-Altman plots with 95% limits of agreement. Because the errors tend to increase with increasing CAC, in the Bland-Altman plots, regression for nonuniform differences was used to model the variation of the absolute differences between manual and automatic scoring (26). To calculate the 95% limits of agreement, the predicted absolute differences were multiplied by 1.96 × (π/2)0.5 because the absolute differences have a half-normal distribution. Reliability of the CVD risk categorization was assessed with the Cohen linearly weighted κ statistic, and agreement was assessed by using proportion of participants assigned to the same category manually and automatically. To test whether differences in absolute errors in quantified per-label calcification volume and Agatston score between algorithms were significant, a Wilcoxon signed-rank test for paired samples was used with a significance level of .05. Analyses were performed with statistical software (SPSS, version 23; IBM, Armonk, NY) and an online statistical tool (27).

Results

Demographics, Prevalence, and Extent of Coronary and Thoracic Calcifications

Median age in test sets ranged from 52 years for CAC CT to 67 years for ACPET CT (Table 1). Percentage of women in the test populations ranged from 28% for CAC CT to 100% for RadTherapy CT. The prevalence (CAC, 15%–84%: TAC, 33%–92%) and extent of CAC and TAC (CAC: median Agatston score, 18–671; TAC: median volume, 174–1522 mm3) differed substantially across the data sets.

DL Evaluation of Agatston Score and CAC Volume

For CAC volume, the baseline network yielded ICCs of 0.97 (95% confidence interval [CI]: 0.96, 0.97) for CAC CT, 0.84 (95% CI: 0.80, 0.88) for ACPET CT, 0.90 (95% CI: 0.88, 0.92) for diagnostic chest CT, and 0.85 (95% CI: 0.83, 0.87) for RadTherapy CT examinations (Table 2). ICCs using the data set–specific network improved to 0.98 (95% CI: 0.98, 0.99) for CAC CT, 0.97 (95% CI: 0.96, 0.98) for ACPET, 0.98 (95% CI: 0.97, 0.98) for diagnostic chest CT, and 0.92 (95% CI: 0.91, 0.93) for RadTherapy CT examinations. The combined network showed a detection performance that was similar to that of the data-specific network. In line with this, Bland-Altman plots in Figure 4 showed improved agreement of Agatston scores for the combined network with respect to the baseline network.

Table 2: Volume-wise Performance Evaluation of Baseline, Data-Specific, and Combined Training Networks

Table 2:
Figure 4:

Figure 4: Bland-Altman plots of coronary artery calcium (CAC) Agatston scores with 95% limits of agreement (dashed lines) comparing manual scoring with automatic scoring in CAC CT, PET attenuation correction (ACPET) CT, diagnostic chest, radiation therapy treatment planning (RadTherapy), and CAC research CT from the Jackson Heart Study (JHS). Outliers are indicated by an arrow, with difference given, and 95% limits of agreement are represented by the formula: difference = ±1.96 · (π/2)0.5 · (b + a · Mean0.5). For the baseline algorithm coefficients, a and b were 4.6 and −7.1, respectively, for CAC CT; 18.2 and −178.0, respectively, for ACPET CT; 10.6 and −46.5, respectively, for diagnostic chest CT; 7.2 and −3.2, respectively, for RadTherapy; and 10.6 and −24.3, respectively, for JHS CT examinations. For the combined algorithm, coefficients a and b were 1.8 and −1.9, respectively, for CAC CT; 7.9 and −76.2, respectively, for ACPET CT; 3.4 and 2.7, respectively, for diagnostic chest CT; 4.8 and −2.4, respectively, for RadTherapy; and 6.6 and 10.8, respectively, for JHS examinations.

On average, the total computation time ranged from 2 to 7 minutes depending on the examination type, image volume, and extent of calcium burden.

The agreement of the automatically detected TAC volumes improved in three of four data sets when representative data were added to the training. The baseline network had an ICC for TAC volume of 0.85 (95% CI: 0.83, 0.87) for CAC CT, 0.66 (95% CI: 0.57, 0.73) for ACPET CT, 0.96 (95% CI: 0.95, 0.97) for diagnostic chest CT, and 0.98 (95% CI: 0.97, 0.98) for RadTherapy CT examinations (Table 2). ICCs for CAC CT, ACPET CT, and RadTherapy CT increased to 0.96 (95% CI: 0.95, 0.96), 0.96 (95% CI: 0.94, 0.97), and 0.99 (95% CI: 0.98, 0.99), respectively, for the data-specific network and to similar ICCs for the combined network. The ICC for diagnostic chest CT was lower than the baseline value for the data-specific network but similar to the baseline value for the combined network, with 0.97 (95% CI: 0.96, 0.98).

Bland-Altman plots (Fig 5) show that the limits of agreement for TAC were narrower for the combined network with respect to the baseline for CAC CT, RadTherapy CT, and ACPET CT examinations. However, the 95% limits of agreement for the combined training did not visibly differ from baseline for diagnostic chest CT examinations.

Figure 5:

Figure 5: Bland-Altman plots of thoracic aorta calcification volumes (in cubic millimeters) with 95% limits of agreement (dashed lines) comparing manual scoring with automatic scoring in coronary artery calcium (CAC) CT, PET attenuation correction CT (ACPET), clinical chest CT, and radiation therapy treatment planning (RadTherapy) CT. Outliers are indicated by an arrow, with difference given, and 95% limits of agreement are represented by the formula: difference = ±1.96 · (π/2)0.5 · (b + a · Mean0.5). For the baseline algorithm, coefficients a and b were 21.9 and −38.3, respectively, for CAC CT; 40.4 and −859.2, respectively, for ACPET CT, 11.9 and −18.0, respectively, for diagnostic chest CT; and 8.1 and 17.5, respectively, for RadTherapy examinations. For the combined algorithm, coefficients a and b were 11.5 and −20.5, respectively, for CAC CT; 17.2 and −226.1, respectively, for ACPET CT; 11.4 and −3.9, respectively, for diagnostic chest CT; and 6.7 and −27.1, respectively, for RadTherapy examinations.

Risk Category Assignment

Reliability and accuracy of the CVD risk categories were high (κ > 0.81) for all three networks (Table 3). The baseline network achieved linearly weighted κ values of 0.95, 0.88, 0.90, 0.85, and 0.85 for CAC CT, ACPET CT, diagnostic chest CT, RadTherapy CT, and JHS CT examinations, respectively. For all test sets combined, the κ value was 0.90 (95% CI: 0.89, 0.91). Reliability increased with the data-specific network to κ values of 0.98, 0.91, 0.89, and 0.90 for CAC CT, diagnostic chest CT, RadTherapy CT, and JHS CT examinations, respectively. These values are similar to the reported reliability in NLST (κ = 0.91) (10). However, reliability of the data-specific network in ACPET CT examinations was slightly lower than at baseline (κ = 0.84). For the combined network, reliability was excellent in all examination types, with a κ value of 0.97 for CAC CT, 0.92 for ACPET CT, 0.91 for diagnostic chest CT, 0.91 for RadTherapy CT, and 0.89 for JHS CT examinations. Overall, the κ value was 0.92 (95% CI: 0.91, 0.93) for all test sets combined for both the data-specific network and the combined network.

Table 3: Reliability of Continuous Agatston Scores and Risk Category Assignment

Table 3:

The majority of patients were assigned to the correct CVD risk category (range, 2194 of 2479 [89%] to 507 of 529 [96%]), and 14 of 529 (3%) to 172 of 2479 (7%) ended up in the neighboring risk group (Fig 6). Points along the axes present zero-score participants in whom the algorithm falsely detected CAC lesions or participants with a nonzero score in whom the algorithm missed a CAC lesion. These errors mostly involved lesions in the coronary ostia labeled as TAC or mitral valve calcifications labeled as left circumflex artery.

Figure 6:

Figure 6: Graphs show Agatston scores calculated automatically with the combined algorithm plotted against manually calculated Agatston scores for scoring in coronary artery calcium (CAC) CT, PET attenuation correction (ACPET) CT, diagnostic CT of chest, radiation therapy treatment planning (RadTherapy), and CAC research CT from the Jackson Heart Study (JHS). Difference between risk categories (RCs) assigned by manual and automatic calcium scoring is indicated by colored blocks. Cardiovascular disease risk categories are as follows: 0, 1–10, 11–100, 101–400, >400. For JHS examinations, random selection of 500 examinations is shown for visualization purposes. Note that scale is log scale.

Detection of Presence versus Absence of CAC

Many studies have demonstrated the negative predictive value of a CAC score of zero (28,29). Accuracy of detection of zero-score participants was 263 of 273 (96%; 95% CI: 94%, 99%), 23 of 32 (72%; 95% CI: 56%, 87%), 95 of 113 (84%; 95% CI: 77%, 91%), 690 of 713 (97%; 95% CI: 95%, 98%), and 1068 of 1268 (84%; 95% CI: 82%, 86%) for the baseline algorithm in CAC CT, ACPET CT, diagnostic chest CT, RadTherapy CT, and JHS CT examinations, respectively (Table 4). For the data-specific algorithm, accuracy was 268 of 273 (98%; 95% CI: 97%, 100%), 12 of 32 (38%; 95% CI: 21%, 54%), 94 of 113 (83%; 95% CI: 76%, 90%), 691 of 713 (97%; 95% CI: 96%, 98%), and 1168 of 1268 (92%; 95% CI: 91%, 94%) for the baseline algorithm in CAC CT, ACPET CT, diagnostic chest CT, RadTherapy CT, and JHS CT examinations, respectively. For the combined algorithm, accuracy was 269 of 273 (99%; 95% CI: 97%, 100%), 22 of 32 (69%; 95% CI: 53%, 85%), 98 of 113 (87%; 95% CI: 80%, 93%), 701 of 713 (98%; 95% CI: 97%, 99%), and 1149 of 1268 (91%; 95% CI: 89%, 92%) for the baseline algorithm in CAC CT, ACPET CT, diagnostic chest CT, RadTherapy CT, and JHS CT examinations, respectively. Errors that assigned a participant with a zero score to the neighboring risk category, or vice versa, occurred mostly in voxels representing noise in the direct proximity of a coronary artery or in small lesions affected by motion. Larger errors that assigned a zero-score participant to a higher risk category were described earlier.

Table 4: Detection of Zero-Score CAC Scans

Table 4:

Discussion

Calcium scoring is commonly used for cardiovascular disease (CVD) risk prediction from nonenhanced electrocardiographically triggered cardiac CT. Alternatively, calcium scores can be derived from routine chest CT images obtained for other purposes, such as lung cancer screening. In this study, we demonstrated that a deep learning (DL) method for automatic calcium scoring is robust with regard to substantial differences in CT type (eg, PET attenuation correction [ACPET] vs low-dose CT of the chest) and patient population (eg, patients with breast cancer vs heavy smokers older than 55 years) without additional training. The DL method yielded an intraclass correlation coefficient (ICC) between automatic and manual reference scores of 0.79–0.97 for coronary artery calcium (CAC) score and 0.66–0.98 for thoracic aorta calcification (TAC) score with different CT protocols. In addition, augmentation of the baseline training data with a relatively small data-specific set improved the performance to the level achieved with the data for which the network was originally developed (ICC range, 0.84–0.99 for CAC score and 0.92–0.99 for TAC score). Furthermore, training one instance with a combination of all included image types resulted in a network with similar performance (ICC range, 0.85–0.99 for CAC score and 0.96–0.99 for TAC score).

Calcium scoring is a challenging task in several of the included image types. Whereas in the literature interobserver agreement is high with dedicated CAC CT (Agatston ICC = 0.99; risk category κ = 0.99) (6), it is lower with ACPET (κ = 0.94 for four risk categories) (30). Similarly, the performance of the baseline network varied among the data sets. We demonstrated that in various types of CT scans, the agreement of automatically and manually assigned CVD risk categories was high (κ range, 0.85–0.95), and it approached interobserver agreement (6,30,31).

After CT protocol-specific training, the DL method performed similarly to previous methods (κ = 0.84 vs κ = 0.85 for ACPET CT) (13) or outperformed them (κ = 0.98 vs κ = 0.91 for CAC CT, κ = 0.89 vs κ = 0.80 for RadTherapy, κ = 0.91 vs κ = 0.80 for chest CT) (6,12,32). Note that because different data sets were used, a direct comparison cannot be made except in the case of CAC CT, in which the same test set was used (6). Although the use of CT protocol-specific training resulted in improved risk category assessment for CAC CT, RadTherapy, and JHS examinations, for ACPET CT, the agreement of risk categories slightly decreased. Visual inspection of the results revealed that the ACPET CT-specific network segmented small false-positive lesions, mostly representing noise in the vicinity of the coronaries, more often than the baseline, resulting in incorrect CVD risk categorization in the lowest risk categories. The ACPET CT examinations were more affected by noise than the other types of evaluated examinations. Moreover, because for ACPET only 199 representative examinations were added to the training set, a possible explanation for degrading performance could be insufficient variation in the representative training examples. By combining different types of examinations into one training set, the network was able to score all included types of CT examinations with high reliability compared with manual scoring.

For assessment of CVD risk, it is relevant to distinguish presence and absence of CAC (29). Because the accuracy for detecting zero-score patients is excellent (268 of 273 [98%] for the data-specific algorithm) and the false-positive rate is low (one of 256 [0%]), the presented method would be suited to rule out presence of calcifications in patients with dedicated CAC CT. With other CT protocols, errors occurred more frequently (accuracy from 12 of 32 [38%] with ACPET CT to 691 of 713 [97%] with RadTherapy CT), but one should note that in these more challenging examinations such errors are not uncommon for human readers either.

Multiple strategies have been proposed for adapting DL networks to a different task or different input data. One well-known and well-performing approach is fine-tuning a pretrained network (33,34). In our study, the baseline training set was available; therefore, we retrained the networks from initialization with a combination of CT protocol-specific data and the baseline. This approach improved the performance of the network compared with that of the baseline. In situations in which the baseline data set is not available, fine-tuning could be an interesting alternative.

This study had several limitations. First, instead of using pixelwise annotation, as in the NLST examinations (10), we annotated lesions by identifying three-dimensional clusters of voxels above 130 HU matching standard calcium scoring. The training set consisted mostly of NLST scans, so this difference in training and test references might have influenced the performance of the networks. However, given that the automatically obtained segmentations are postprocessed by matching standard calcium scoring, we expect this influence to be marginal. Second, reference calcium scores in the JHS examinations were obtained with a different protocol (21). However, because both manual and automatic scoring methods identified three-dimensional clusters of voxels with attenuation above 130 HU, we expect the discrepancy to be negligible. Third, because most CT images were acquired for purposes other than calcium scoring, not all scans were acquired according to Society of Cardiovascular Computed Tomography and Society of Thoracic Radiology guidelines for calcium scoring in noncardiac chest CT (35). However, updated American Heart Association and American College of Cardiology guidelines (1) focus on presence versus absence and risk categorization. Relatively small shifts in absolute score are of uncertain clinical importance. Moreover, because the performance of the DL method is excellent, it can potentially help radiologists report CAC in nonenhanced chest CT images per Society of Cardiovascular Computed Tomography and Society of Thoracic Radiology guidelines advice. Last, we did not investigate how many representative examinations should be added to the training set to ensure optimal performance. Instead, we estimated the number of data-specific examinations based on the calcium burden in the relevant population.

In conclusion, this study presented the validation of a deep learning algorithm in large and diverse sets of CT examinations. The results show that the method adapted well to a previously unseen CT type when a few representative training examples were added to the large training set. One combined deep learning model trained with all CT types performed as well as specialized models, indicating potential for use in clinical practice.

Disclosures of Conflicts of Interest: S.G.M.v.V. disclosed no relevant relationships. N.L. disclosed no relevant relationships. B.K.V. disclosed no relevant relationships. I.E.M.B. disclosed no relevant relationships. D.H.J.G.v.d.B. disclosed no relevant relationships. T.L. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution received grants from the Dutch Technology Foundation (P15-26, 12726), with participation of Pie Medical Imaging and Philips Healthcare, from the Netherlands Organisation for Health Research and Development, with participation of Pie Medical Imaging, and from Pie Medical Imaging; holds stock in Quantib-U; had expenses covered for presentations at a workshop at the University Medical Center (Lausanne, Switzerland) that was organized by Siemens Healthineers and the University of Lausanne and at the European Cancer Summit 2019; has patents pending (U.S. patent app. 16/379,248) and issued (U.S. patent no. 10,395,366) through Pie Medical Imaging and is planning to receive royalties. Other relationships: is the cofounder, scientific lead, and a shareholder at Quantib-U. P.A.d.J. Activities related to the present article: institution received a grant from Philips Healthcare. Activities not related to the present article: disclosed no relevant relationships. Other relationships: disclosed no relevant relationships. W.B.V. disclosed no relevant relationships. A.C. disclosed no relevant relationships. J.G.T. disclosed no relevant relationships. J.J.C. disclosed no relevant relationships. M.A.V. disclosed no relevant relationships. H.M.V. disclosed no relevant relationships. I.I. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution received grants from the Dutch Technology Foundation (P15-26, 12726), with participation of Pie Medical Imaging and Philips Healthcare, from the Netherlands Organisation for Health Research and Development, with participation of Pie Medical Imaging, and from Pie Medical Imaging; holds stock in Quantib-U; had expenses covered for presentations at a workshop at the University Medical Center (Lausanne, Switzerland) that was organized by Siemens Healthineers and the University of Lausanne and at the European Cancer Summit 2019; has patents pending (U.S. patent app. 16/379,248) and issued (U.S. patent no. 10,395,366) through Pie Medical Imaging and is planning to receive royalties. Other relationships: is the cofounder, scientific lead, and a shareholder at Quantib-U.

Acknowledgments

We thank the Dutch Cancer Society for supporting this research and are grateful to the U.S. National Cancer Institute (NCI) for providing access to data collected in the National Lung Screening Trial. The statements contained in this article are solely ours and do not represent or imply concurrence or endorsement by the NCI. We also thank the staffs and participants of the Jackson Heart Study (JHS). The JHS is supported and conducted in collaboration with Jackson State University (HHSN268201800013I), Tougaloo College (HHSN268201800014I), the Mississippi State Department of Health (HHSN268201800015I/HHSN26800001), and the University of Mississippi Medical Center (HHSN268201800010I, HHSN268201800011I, and HHSN268201800012I) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the NHLBI, the National Institutes of Health; or the U.S. Department of Health and Human Services.

Author Contributions

Author contributions: Guarantors of integrity of entire study, S.G.M.v.V., D.H.J.G.v.d.B., A.C.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, S.G.M.v.V., T.L., I.I.; clinical studies, I.E.M.B., D.H.J.G.v.d.B., T.L., W.B.V., J.G.T., J.J.C.; statistical analysis, S.G.M.v.V., A.C., H.M.V.; and manuscript editing, S.G.M.v.V., N.L., B.K.V., I.E.M.B., D.H.J.G.v.d.B., T.L., W.B.V., A.C., J.G.T., J.J.C., M.A.V., H.M.V., I.I.

Supported by the Dutch Cancer Society (NCT03206333).

References

  • 1. Grundy SM, Stone NJ, Bailey AL, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA Guideline on the Management of Blood Cholesterol: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation 2019;139(25):e1046–e1081. MedlineGoogle Scholar
  • 2. Hecht HS. Coronary artery calcium scanning: past, present, and future. JACC Cardiovasc Imaging 2015;8(5):579–596. Crossref, MedlineGoogle Scholar
  • 3. Išgum I, Rutten A, Prokop M, van Ginneken B. Detection of coronary calcifications from computed tomography scans for automated risk assessment of coronary artery disease. Med Phys 2007;34(4):1450–1461. Crossref, MedlineGoogle Scholar
  • 4. Kurkure U, Chittajallu DR, Brunner G, Le YH, Kakadiaris IA. A supervised classification-based method for coronary calcium detection in non-contrast CT. Int J Cardiovasc Imaging 2010;26(7):817–828. Crossref, MedlineGoogle Scholar
  • 5. Shahzad R, van Walsum T, Schaap M, et al. Vessel specific coronary artery calcium scoring: an automatic system. Acad Radiol 2013;20(1):1–9. Crossref, MedlineGoogle Scholar
  • 6. Wolterink JM, Leiner T, Takx RAP, Viergever MA, Išgum I. Automatic Coronary Calcium Scoring in Non-Contrast-Enhanced ECG-Triggered Cardiac CT With Ambiguity Detection. IEEE Trans Med Imaging 2015;34(9):1867–1878. Crossref, MedlineGoogle Scholar
  • 7. Shemesh J, Henschke CI, Shaham D, et al. Ordinal scoring of coronary artery calcifications on low-dose CT scans of the chest is predictive of death from cardiovascular disease. Radiology 2010;257(2):541–548. LinkGoogle Scholar
  • 8. Jacobs PC, Gondrie MJA, van der Graaf Y, et al. Coronary artery calcium can predict all-cause mortality and cardiovascular events on low-dose CT screening for lung cancer. AJR Am J Roentgenol 2012;198(3):505–511. Crossref, MedlineGoogle Scholar
  • 9. Chiles C, Duan F, Gladish GW, et al. Association of Coronary Artery Calcification and Mortality in the National Lung Screening Trial: A Comparison of Three Scoring Methods. Radiology 2015;276(1):82–90. LinkGoogle Scholar
  • 10. Lessmann N, van Ginneken B, Zreik M, et al. Automatic Calcium Scoring in Low-Dose Chest CT Using Deep Neural Networks With Dilated Convolutions. IEEE Trans Med Imaging 2018;37(2):615–625. Crossref, MedlineGoogle Scholar
  • 11. Gernaat SAM, van Velzen SGM, Koh V, et al. Automatic quantification of calcifications in the coronary arteries and thoracic aorta on radiotherapy planning CT scans of Western and Asian breast cancer patients. Radiother Oncol 2018;127(3):487–492. Crossref, MedlineGoogle Scholar
  • 12. Gernaat SAM, Išgum I, de Vos BD, et al. Automatic coronary artery calcium scoring on radiotherapy planning CT Scans of breast cancer patients: Reproducibility and association with traditional cardiovascular risk factors. PLoS One 2016;11(12):e0167925. Crossref, MedlineGoogle Scholar
  • 13. Išgum I, de Vos BD, Wolterink JM, et al. Automatic determination of cardiovascular risk by CT attenuation correction maps in Rb-82 PET/CT. J Nucl Cardiol 2018;25(6):2133–2142 [Published correction appears in J Nucl Cardiol 2018;25(6):2143.] https://doi.org/10.1007/s12350-017-0866-3. Crossref, MedlineGoogle Scholar
  • 14. Takasu J, Katz R, Nasir K, et al. Relationships of thoracic aortic wall calcification to cardiovascular risk factors: the Multi-Ethnic Study of Atherosclerosis (MESA). Am Heart J 2008;155(4):765–771. Crossref, MedlineGoogle Scholar
  • 15. Išgum I, Rutten A, Prokop M, et al. Automated aortic calcium scoring on low-dose chest computed tomography. Med Phys 2010;37(2):714–723. Crossref, MedlineGoogle Scholar
  • 16. Menze BH, Jakab A, Bauer S, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging 2015;34(10):1993–2024. Crossref, MedlineGoogle Scholar
  • 17. Setio AAA, Traverso A, de Bel T, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Med Image Anal 2017;42:1–13. Crossref, MedlineGoogle Scholar
  • 18. Park SH, Han K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 2018;286(3):800–809. LinkGoogle Scholar
  • 19. Bank IEM. Ischaemic Heart Disease: Early Recognition and Risk Disparities [dissertation]. Vol. Chapter 3. Utrecht, the Netherlands: University of Utrecht, 2017. Google Scholar
  • 20. Taylor HA Jr, Wilson JG, Jones DW, et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn Dis 2005;15(4 Suppl 6):S6–S4, 17. MedlineGoogle Scholar
  • 21. Sung JH, Yeboah J, Lee JE, et al. Diagnostic Value of Coronary Artery Calcium Score for Cardiovascular Disease in African Americans: The Jackson Heart Study. Br J Med Med Res 2016;11(2):BJMMR/2016/21449. Crossref, MedlineGoogle Scholar
  • 22. National Lung Screening Trial Research Team, Aberle DR, Adams AM, et al. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med 2011;365(5):395–409. Crossref, MedlineGoogle Scholar
  • 23. de Vos BD, Wolterink JM, de Jong PA, Leiner T, Viergever MA, Išgum I. ConvNet-Based Localization of Anatomical Structures in 3-D Medical Images. IEEE Trans Med Imaging 2017;36(7):1470–1481. Crossref, MedlineGoogle Scholar
  • 24. Agatston AS, Janowitz WR, Hildner FJ, Zusmer NR, Viamonte M Jr, Detrano R. Quantification of coronary artery calcium using ultrafast computed tomography. J Am Coll Cardiol 1990;15(4):827–832. Crossref, MedlineGoogle Scholar
  • 25. Budoff MJ, Shaw LJ, Liu ST, et al. Long-term prognosis associated with coronary calcification: observations from a registry of 25,253 patients. J Am Coll Cardiol 2007;49(18):1860–1870. Crossref, MedlineGoogle Scholar
  • 26. Sevrukov AB, Bland JM, Kondos GT. Serial electron beam CT measurements of coronary artery calcium: Has your patient’s calcium score actually changed? AJR Am J Roentgenol 2005;185(6):1546–1553. Crossref, MedlineGoogle Scholar
  • 27. Lowry R. Kappa as a Measure of Concordance in Categorical Sorting. http://vassarstats.net/kappa.html. Accessed September 19, 2019. Google Scholar
  • 28. Blaha MJ, Cainzos-Achirica M, Greenland P, et al. Role of Coronary Artery Calcium Score of Zero and Other Negative Risk Markers for Cardiovascular Disease: The Multi-Ethnic Study of Atherosclerosis (MESA). Circulation 2016;133(9):849–858. Crossref, MedlineGoogle Scholar
  • 29. Carr JJ, Jacobs DR Jr, Terry JG, et al. Association of Coronary Artery Calcium in Adults Aged 32 to 46 Years With Incident Coronary Heart Disease and Death. JAMA Cardiol 2017;2(4):391–399. Crossref, MedlineGoogle Scholar
  • 30. Mylonas I, Kazmi M, Fuller L, et al. Measuring coronary artery calcification using positron emission tomography-computed tomography attenuation correction images. Eur Heart J Cardiovasc Imaging 2012;13(9):786–792. Crossref, MedlineGoogle Scholar
  • 31. Takx RAP, de Jong PA, Leiner T, et al. Automated coronary artery calcification scoring in non-gated chest CT: agreement and reliability. PLoS One 2014;9(3):e91239. Crossref, MedlineGoogle Scholar
  • 32. Cano-Espinosa C, González G, Washko GR, Cazorla M, Estépar RSJ. Automated Agatston Score Computation in non-ECG Gated CT Scans Using Deep Learning. In: Angelini ED, Landman BA, eds. Proceedings of SPIE: medical imaging 2018—image processing. Vol 10574. Bellingham, Wash: International Society for Optics and Photonics, 2018; 105742K. Google Scholar
  • 33. Tajbakhsh N, Shin JY, Gurudu SR, et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans Med Imaging 2016;35(5):1299–1312. Crossref, MedlineGoogle Scholar
  • 34. Shin HC, Roth HR, Gao M, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging 2016;35(5):1285–1298. Crossref, MedlineGoogle Scholar
  • 35. Hecht HS, Cronin P, Blaha MJ, et al. 2016 SCCT/STR guidelines for coronary artery calcium scoring of noncontrast noncardiac chest CT scans: A report of the Society of Cardiovascular Computed Tomography and Society of Thoracic Radiology. J Cardiovasc Comput Tomogr 2017;11(1):74–84 [Published correction appears in J Cardiovasc Comput Tomogr 2017;11(2):170.] https://doi.org/10.1016/j.jcct.2016.11.003. Crossref, MedlineGoogle Scholar

Article History

Received: July 22 2019
Revision requested: Sept 10 2019
Revision received: Nov 16 2019
Accepted: Dec 12 2019
Published online: Feb 11 2020
Published in print: Apr 2020