Original ResearchFree Access

Deep Learning Analysis of Chest Radiographs to Triage Patients with Acute Chest Pain Syndrome

Published Online:https://doi.org/10.1148/radiol.221926

Abstract

Background

Patients presenting to the emergency department (ED) with acute chest pain (ACP) syndrome undergo additional testing to exclude acute coronary syndrome (ACS), pulmonary embolism (PE), or aortic dissection (AD), often yielding negative results.

Purpose

To assess whether deep learning (DL) analysis of the initial chest radiograph may help triage patients with ACP syndrome more efficiently.

Materials and Methods

This retrospective study used electronic health records of patients with ACP syndrome at presentation who underwent a combination of chest radiography and additional cardiovascular or pulmonary imaging or stress tests at two hospitals (Massachusetts General Hospital [MGH], Brigham and Women’s Hospital [BWH]) between January 2005 and December 2015. A DL model was trained on 23 005 patients from MGH to predict a 30-day composite end point of ACS, PE, AD, and all-cause mortality based on chest radiographs. Area under the receiver operating characteristic curve (AUC) was used to compare performance between models (model 1: age + sex; model 2: model 1 + conventional troponin or d-dimer positivity; model 3: model 2 + DL predictions) in internal and external test sets from MGH and BWH, respectively.

Results

At MGH, 5750 patients (mean age, 59 years ± 17 [SD]; 3329 men, 2421 women) were evaluated. Model 3, which included DL predictions, significantly improved discrimination of those with the composite outcome compared with models 2 and 1 (AUC, 0.85 [95% CI: 0.84, 0.86] vs 0.76 [95% CI: 0.74, 0.77] vs 0.62 [95% CI: 0.60 0.64], respectively; P < .001 for all). When using a sensitivity threshold of 99%, 14% (813 of 5750) of patients could be deferred from cardiovascular or pulmonary testing for differential diagnosis of ACP syndrome using model 3 compared with 2% (98 of 5750) of patients using model 2 (P < .001). Model 3 maintained its diagnostic performance in different age, sex, race, and ethnicity groups. In external validation at BWH (22 764 patients; mean age, 57 years ± 17; 11 470 women), trends were similar and improved after fine tuning.

Conclusion

Deep learning analysis of chest radiographs may facilitate more efficient triage of patients with acute chest pain syndrome in the emergency department.

© RSNA, 2023

Supplemental material is available for this article.

See also the editorial by Goo in this issue.

Summary

Deep learning analysis of initial chest radiographs may facilitate more efficient triage of patients presenting to the emergency department with symptoms of acute chest pain syndrome.

Key Results

  • ■ An open-source deep learning model was retrospectively developed from 23 005 initial chest radiographs of patients with acute chest pain syndrome to identify acute coronary syndrome, pulmonary embolism, aortic dissection, and all-cause mortality.

  • ■ The model outperformed a model including age, sex, and troponin ord-dimer positivity on the internal test set (area under the receiver operating characteristic curve, 0.85 vs 0.76; P < .001).

  • ■ The model deferred more patients (14% vs 2%, P < .001) from additional cardiovascular or pulmonary testing with 99% sensitivity.

Introduction

Acute chest pain (ACP) syndrome accounts for over 7 million emergency department (ED) visits annually in the United States, making it one of the most common reasons for a visit (5.5% of all ED visits) (1). A minority (range, 2.0%–7.5%) of these patients are diagnosed with one of the three major cardiovascular causes of ACP syndrome: acute coronary syndrome (ACS), pulmonary embolism (PE), or aortic dissection (2,3). However, the life-threatening nature of these conditions and the low specificity of clinical evaluation (eg, electrocardiograms) and blood tests (eg, d-dimer assay) leads to substantial use of cardiovascular and pulmonary diagnostic imaging (27), often yielding negative results (2,3). As EDs struggle with high patient numbers and a shortage of hospital beds (8,9), effectively triaging patients with very low risk of ACS, PE, or aortic dissection who would not benefit from additional cardiovascular or pulmonary testing could improve resource and cost efficiency.

Deep learning (DL) has been used to diagnose conditions such as pneumonia and pneumothorax from chest radiographs (10), potentially increasing the efficiency of care (11,12). DL has also been used to predict frailty and long-term prognosis from chest radiographs in asymptomatic individuals (13,14). Patients presenting to the ED with symptoms of ACP syndrome often undergo initial chest radiography and may benefit from such DL analysis (15). As a proof of concept, our primary aim was to assess whether DL analysis of initial chest radiographs of patients presenting to the ED with ACP syndrome can be used to predict a composite outcome of 30-day ACS, PE, aortic dissection, or all-cause death.

Materials and Methods

The institutional review board approved this study and waived the requirement for written informed consent due to its retrospective nature. All procedures were in performed in accordance with the Health Insurance Portability and Accountability Act and local and federal regulations.

Study Patients

We used the Research Patient Data Registry containing electronic health record information from hospitals in the Mass General Brigham system (16) to obtain our study sample. We included ED visits from two tertiary care hospitals, Massachusetts General Hospital (MGH) and Brigham and Women’s Hospital (BWH). Because of the completeness of clinical and social security death index data, we queried for patient encounters between January 1, 2005, and December 31, 2015. We identified adults presenting to the ED with ACP syndrome using the following inclusion criteria: (a) age of more than 18 years; (b) conventional troponin I or T test, d-dimer test, or both; (c) additional cardiovascular or pulmonary imaging or stress testing; and (d) chest radiography. All procedures needed to be performed within 1 day of ED registration. When an individual had multiple ED visits, we used only the first encounter. Logical Observation Identifiers Names and Codes used to identify conventional troponin I or T tests, d-dimer tests, and Current Procedural Terminology codes used to identify additional cardiovascular or pulmonary imaging or stress testing are listed in Appendix S1.

Age at ED visit and self-reported sex, race, and ethnicity were acquired from the Research Patient Data Registry system. Conventional troponin I and T and d-dimer positivity were determined using the assay-specific age and sex thresholds for identifying abnormal values. As patients may have undergone only one serum laboratory test, we defined serum biomarker positivity as having at least one abnormal troponin I or T or d-dimer test finding within 1 day of ED registration.

We used the earliest chest radiograph available within 1 day of ED registration to train our DL model. Anteroposterior and posteroanterior chest radiographs were obtained using portable and nonportable devices.

To assist in the analysis of approximately 1 terabyte of electronic health record data, we developed the parseRPDR (https://CRAN.R-project.org/package=parseRPDR) open-source R package that provides a standardized framework to analyze the Research Patient Data Registry and radiologic data.

Study Outcomes

Outcomes were identified using International Classification of Diseases codes (Appendix S1) registered within the electronic health record system. Our primary outcome was the composite end point of ACS, PE, aortic dissection, or all-cause mortality within 30 days after ED registration. Our secondary outcomes were the individual outcomes.

Training, Validation, and Test Sets

Unique individuals and ED visits were identified using the hospital system–wide unique patient and visit identifier. Data from MGH were considered the internal set, while data from BWH served as the external set. The MGH internal set was split into training (used to fit the model), validation (used to select the optimal parameters of the model), and test (data unseen during the training process) sets using a 60%-20%-20% stratified random split to ensure that prevalence of our primary outcome was similar within the sets. For our primary analyses, the entire BWH set was considered an external test set. In secondary analyses, we split the BWH set further to assist in fine-tuning the model (see Secondary analyses section).

DL Model Training

DL models were trained using the internal training set to identify the primary outcome, and the model providing the best discriminatory power on the validation set was used on the test sets. Models were trained in the Python environment (version 3.7.12) using the fastai library (version 2.5.3), which is built on top of PyTorch (version 1.10.0) (1719). DL models were calibrated using Platt scaling (20). All code used to train the models and the final trained best model are open source (https://github.com/martonkolossvary/DL_2D) (commit Hash: d7abd8). Detailed descriptions of the image processing, training, and calibration processes, as well as the models used, are presented in Appendix S1 and Table S1.

Prediction models.—To evaluate the additive value of DL predictions, we built the following nested logistic regression models on the training set: model 1, age + sex; model 2: model 1 + biomarker positivity; model 3, model 2 + DL probability predictions from chest radiographs.

Secondary analyses.—First, as DL models typically do not generalize well to external sets, we wished to assess whether fine-tuning of our trained model using images from the BWH external set may improve model accuracy. Details are presented in Appendix S1.

Second, we evaluated the DL model performance for each outcome (ACS, PE, aortic dissection, and death) separately.

Third, we evaluated the predictive performance in the following subgroups: women and men; younger (age < median age of given cohort) and older (age ≥ median age of given cohort) patients; and self-designated Black, Hispanic, and non-Hispanic White patients.

Fourth, to simulate the clinical value of model use, we calculated reclassification tables comparing the performance of model 2 with that of model 3. We used a sensitivity of 99%, which is commonly considered the threshold of acceptable risk in the ED (21), to calculate the probability threshold on the training data for each model and classified the patients as cardiovascular or pulmonary testing deferrable (risk below threshold) and cardiovascular or pulmonary testing required for differential diagnosis of ACP syndrome (risk above the threshold).

Gradient-weighted class activation maps.—To better understand which features were identified by the DL model, we created gradient-weighted class activation maps of the images. Details are presented in Appendix S1.

Statistical analyses.—Continuous variables were compared using the Student t test or the analysis of variance test, as appropriate, while categorical values were compared using the χ2 test. To evaluate the performance of our trained models, we used area under the receiver operating characteristic curve (AUC) values. AUC values for each model were compared using the DeLong method. We calculated the net reclassification improvement index to assess the degree of reclassification of the models. All DL model building and statistical calculations were performed (M.K., 6 years of experience) in the R environment using the pROC (version 1.18.0) and PredictABEL (version 1.2–4) packages (22,23). All results are reported in compliance with the Standards for Reporting of Diagnostic Accuracy, or STARD, guidelines (24). P < .05 indicated a significant difference. All code used for data processing and statistical analysis is available online (https://github.com/martonkolossvary/CXR_ED) (commit Hash: 6bf2bdd).

Results

Patient Characteristics

A total of 28 755 MGH and 22 764 BWH ED registrations fulfilling the inclusion criteria were identified. Age (mean, 59 years ± 17 [SD] vs 57 years ± 17; P < .001) and sex (12 459 women vs 11 470 women, P < .001) differed between the MGH and BWH cohorts, respectively. Further differences were found regarding race, ethnicity, chest radiograph characteristics, and troponin I or T and d-dimer positivity rates. The prevalence of the composite 30-day outcome was 16% (4542 of 28 755) and 17% (3835 of 22 764) (P = .001) for the internal and external sets, respectively. Detailed results are presented in Table 1.

Table 1: Patient Characteristics

Table 1:

Patient Allocation and Training of the Deep Learning Model

Unique individuals in the MGH set were divided into groups of 17 254 (60%), 5751 (20%), and 5750 (20%) patients for training, validation, and internal testing of the DL model (Table S2). The BWH set (n = 22 764) was used for external testing of the model.

The Densenet-121 model with increased levels of data augmentation yielded the best performance on the validation set and was therefore used for all further analyses on the test sets. Detailed results for all DL models are reported in Table S3. The DL model was well calibrated, showing higher correlation between predicted and observed event percentages in patients with lower predicted risk (Fig S1).

Performance on Internal and External Test Sets

On the internal test set, model 1 (age + sex) had an AUC of 0.62 (95% CI: 0.60, 0.64), which was significantly improved by adding serum biomarker information (model 2, age + sex + biomarkers; AUC, 0.76; 95% CI: 0.74, 0.77; P < .001). Furthermore, adding DL predictions further improved the discriminatory power (model 3, age + sex + biomarkers + DL predictions; AUC = 0.85; 95% CI: 0.84, 0.86; P < .001).

On the external test set, models 1 and 2 maintained their performance (AUC = 0.64 [95% CI: 0.64, 0.65] vs 0.76 [95% CI: 0.75, 0.76], respectively; P < .001). However, the DL model had an AUC of only 0.67 (95% CI: 0.66, 0.68), resulting in a modest improvement in discriminatory power of model 3 (AUC = 0.77 [95% CI: 0.77, 0.78]) compared with model 2 (P < .001). Receiver operating characteristic curves are presented in Figure 1.

Diagnostic performance of the models to predict the composite end                         point within 30 days of emergency department (ED) registration. Model 1, age                         + sex; model 2, model 1 + serum biomarker positivity; model 3, model 2 +                         deep learning (DL) predictions from chest radiographs; model DL, DL                         predictions from chest radiographs. All statistical comparisons between area                         under the receiver operating characteristic curve values of models                         1–3 were significant (P < .001). Data in brackets are 95% CIs.                         Composite end point was defined as follows: acute coronary syndrome or                         pulmonary embolism or aortic dissection or all-cause mortality within 30                         days of ED registration.

Figure 1: Diagnostic performance of the models to predict the composite end point within 30 days of emergency department (ED) registration. Model 1, age + sex; model 2, model 1 + serum biomarker positivity; model 3, model 2 + deep learning (DL) predictions from chest radiographs; model DL, DL predictions from chest radiographs. All statistical comparisons between area under the receiver operating characteristic curve values of models 1–3 were significant (P < .001). Data in brackets are 95% CIs. Composite end point was defined as follows: acute coronary syndrome or pulmonary embolism or aortic dissection or all-cause mortality within 30 days of ED registration.

Model Fine-tuning on the External Set

Patient characteristics of the external training, validation, and testing sets are presented in Table S4. After fine-tuning of the DL model, diagnostic performance in the test set improved compared with the non–fine-tuned model (AUC = 0.74 [95% CI: 0.72, 0.76] vs 0.67 [95% CI: 0.65, 0.69]; P < .001). Fine-tuning also resulted in model 3 having a higher AUC compared with the original model using DL predictions without fine-tuning (AUC = 0.81 [95% CI: 0.80, 0.83] vs 0.78 [95% CI: 0.76, 0.80]; P < .001). Furthermore, the discriminatory power of the fine-tuned model on the external test set was similar to that of the internal test set (AUC = 0.81 [95% CI: 0.80, 0.83] vs 0.85 [95% CI: 0.84, 0.86]).

Diagnostic Accuracy in Identifying Specific Clinical Outcomes

On the internal test set, our DL model trained to identify the composite end point had excellent discriminatory power to identify 30-day all-cause mortality (AUC = 0.87; 95% CI: 0.85, 0.89) and aortic dissection (AUC = 0.86; 95% CI: 0.82, 0.90), good performance in identifying ACS (AUC = 0.78; 95% CI: 0.76, 0.80), and moderate performance in identifying PE (AUC = 0.66; 95% CI: 0.63, 0.69). Adding DL predictions from the chest radiographs to the clinical variables always improved AUC values (P < .001 for all models). Similarly, a decrease in discriminatory power of the DL model was observed in the external test set for all individual outcomes, which improved after fine-tuning (P < .001 for all). As a result, model 3 outperformed model 2 in the external test set for each end point except ACS (AUC = 0.87 [95% CI: 0.86, 0.89] vs 0.88 [95% CI: 0.87, 0.89]; P = .15). Detailed results are presented in Figure 2.

Diagnostic performance of the models in predicting specific 30-day                         clinical end points. Data are area under the receiver operating                         characteristic curve, with the 95% CI in brackets. Model 1, age + sex; model                         2, model 1 + serum biomarker positivity; model 3, model 2 + deep learning                         (DL) predictions from chest radiographs; model DL, DL predictions from chest                         radiographs. Diagnostic performance of the models in predicting specific                         30-day clinical end points. Data are area under the receiver operating                         characteristic curve, with the 95% CI in brackets. Model 1, age + sex; model                         2, model 1 + serum biomarker positivity; model 3, model 2 + DL predictions                         from chest radiographs; model DL, DL predictions from chest                         radiographs.

Figure 2: Diagnostic performance of the models in predicting specific 30-day clinical end points. Data are area under the receiver operating characteristic curve, with the 95% CI in brackets. Model 1, age + sex; model 2, model 1 + serum biomarker positivity; model 3, model 2 + deep learning (DL) predictions from chest radiographs; model DL, DL predictions from chest radiographs. Diagnostic performance of the models in predicting specific 30-day clinical end points. Data are area under the receiver operating characteristic curve, with the 95% CI in brackets. Model 1, age + sex; model 2, model 1 + serum biomarker positivity; model 3, model 2 + DL predictions from chest radiographs; model DL, DL predictions from chest radiographs.

Diagnostic Accuracy in Patient Subgroups

The DL model maintained its discriminatory power in all age (Fig S2), sex (Fig S3), and race or ethnicity (Fig 3) subgroups.

Diagnostic performance of the models in predicting the 30-day                         composite end point stratified by race and ethnicity. Data are area under                         the receiver operating characteristic (AUC) curve, with the 95% CI in                         brackets. Model 1, age + sex; model 2, model 1 + serum biomarker positivity;                         model 3, model 2 + deep learning (DL) predictions from chest radiographs;                         model DL, DL predictions from chest radiographs. All statistical comparisons                         between the AUC values of models 1–3 were significant (P <                         .001). Composite end point was defined aortic dissection or all-cause                         mortality within 30 days of emergency department registration.

Figure 3: Diagnostic performance of the models in predicting the 30-day composite end point stratified by race and ethnicity. Data are area under the receiver operating characteristic (AUC) curve, with the 95% CI in brackets. Model 1, age + sex; model 2, model 1 + serum biomarker positivity; model 3, model 2 + deep learning (DL) predictions from chest radiographs; model DL, DL predictions from chest radiographs. All statistical comparisons between the AUC values of models 1–3 were significant (P < .001). Composite end point was defined aortic dissection or all-cause mortality within 30 days of emergency department registration.

Clinical Use of DL to Defer Additional Cardiovascular Testing

Using a sensitivity threshold of 99% defined using the training set, model 2, which incorporates age, sex, and serum biomarker positivity, identified only 2% (98 of 5750) of patients in the internal test set in whom additional cardiovascular or pulmonary testing for differential diagnosis of ACP syndrome may be deferred. Model 3, which additionally incorporates DL, deferred additional cardiovascular testing in approximately seven times as many patients (14%; 813 of 5750; P < .001). On the external test set, model 2 was able to defer 3% of all patients (565 of 22 764), while model 3 deferred 6% of all patients (1269 of 22 764, P < .001). This ratio significantly improved after fine-tuning our DL model (2% [102 of 4552] for model 2 vs 8% [383 of 4552] for model 3; P < .001). Detailed results are presented in Table 2. Confusion matrices using the 99% sensitivity threshold on the different test sets for models 2 and 3 can be found in Tables S5 and S6, respectively.

Table 2: Number of Patients Deferrable from Additional Cardiovascular or Pulmonary Testing at a 99% Sensitivity Rate

Table 2:

Activation Heat Maps from Chest Radiographs

Gradient-weighted class activation maps show that our model was mostly activated from areas of the heart and lungs. Activations were not observed on the left or right side and other text markings on the radiograph. Fine-tuning resulted in more relevant areas contributing to model predictions. Representative examples are shown in Figures 4 and S4.

Gradient-weighted class activation maps of representative chest                         radiographs in (A) an 85-year-old man with acute coronary syndrome (ACS),                         (B) a 77-year-old man with aortic dissection (AD), (C) a 39-year-old healthy                         man, and (D) a 27-year-old healthy woman. The maps show which parts of the                         images influenced deep learning (DL) model predictions for the composite                         outcome. The color gradient shows the level of activation from that given                         area, where red indicates the highest activation, blue indicates the lowest                         activation, and no color indicates no activation. Areas of the heart and                         lungs contributed most model predictions. Fine-tuning improved the                         diagnostic accuracy of our DL model and resulted in more relevant areas                         contributing to predictions. ACS = acute coronary syndrome, CTA = coronary                         CT angiography, ICA = invasive coronary angiography, SPECT = single-photon                         emission CT.

Figure 4: Gradient-weighted class activation maps of representative chest radiographs in (A) an 85-year-old man with acute coronary syndrome (ACS), (B) a 77-year-old man with aortic dissection (AD), (C) a 39-year-old healthy man, and (D) a 27-year-old healthy woman. The maps show which parts of the images influenced deep learning (DL) model predictions for the composite outcome. The color gradient shows the level of activation from that given area, where red indicates the highest activation, blue indicates the lowest activation, and no color indicates no activation. Areas of the heart and lungs contributed most model predictions. Fine-tuning improved the diagnostic accuracy of our DL model and resulted in more relevant areas contributing to predictions. ACS = acute coronary syndrome, CTA = coronary CT angiography, ICA = invasive coronary angiography, SPECT = single-photon emission CT.

Discussion

In our proof-of-concept study, we developed an open-source deep learning (DL) model to identify patients with acute chest pain syndrome at risk for 30-day acute coronary syndrome, pulmonary embolism, aortic dissection, or all-cause mortality based on a chest radiograph. The DL tool improved prediction of these adverse outcomes beyond age, sex, and conventional troponin/d-dimer positivity (internal test set: AUC = 0.85 vs 0.76; P < .001). Furthermore, our DL model maintained its diagnostic accuracy across age, sex, ethnicity, and racial groups. By using a 99% sensitivity threshold, our DL model was able to defer additional cardiovascular or pulmonary testing in 14% of individuals as compared with 2% of individuals (P < .001) when using a model incorporating only age, sex, and biomarker data. All code and the fitted model are publicly available (https://github.com/martonkolossvary/DL_2D) (commit Hash: d7abd8).

DL analysis of chest radiographs can identify clinically relevant abnormalities, mimicking the work of clinicians (1012), predict long-term mortality of patients (13), identify smokers at high risk who would benefit from lung screening CT (25), and predict the biologic age of individuals (14). Chest radiographs are very common in patients presenting to the emergency department with ACP syndrome (15), presenting an opportunity for artificial intelligence–based technologies to help clinical decision making.

Our results show that DL can help triage patients in the ED by providing a short-term risk estimate of adverse clinical outcomes. Furthermore, as opposed to commonly used chest pain scores (2628), which provide an estimate for ACS, our approach considers all three life-threatening conditions, therefore providing more comprehensive risk assessment. Compared with other chest radiography–based DL models, which identify specific abnormalities, our model provides individualized risk assessment for ACP syndrome, which may help deferral of additional cardiovascular or pulmonary tests in up to four to eight times as many individuals, potentially aiding rapid triage of patients in the ED.

Importantly, our results emphasize the difficulty of creating a generalizable DL model. The performance of our DL model was significantly lower when applied to data from a different hospital. This phenomenon is well known in DL research (29) and precludes the safe deployment of developed DL algorithms without extensive validation on new data sets (30). Fine-tuning our model improved diagnostic accuracy on the remaining images of the external set. These results reinforce that the development of one DL algorithm that works equally well in all settings is challenging. Prior to implementation at a new hospital, any DL model should be tested in that setting to confirm that the performance meets expectations and, if necessary, should be fine-tuned for data at that hospital. Further research is needed to define the minimal number of images needed for effective fine-tuning of DL models, as labeling several thousand new images is not practical (31).

Analysis of individual outcomes showed that the DL model had the best diagnostic accuracy in identifying all-cause mortality. This is in line with findings of previous studies showing the ability of DL to predict long-term mortality (13). The model had the least additive value in predicting ACS. This may be because the International Classification of Diseases coding of ACS is largely based on troponin positivity and models incorporating age, sex, and troponin have been shown to have excellent diagnostic accuracy to identify myocardial infarction (32). Interestingly, all models had the worst performance for PE detection and could not identify these patients with high specificity.

Several concerns have been raised regarding the bias of DL models toward specific age, sex, ethnic, and racial groups (33). These biases are mostly a result of the limited electronic health record data available from underrepresented patient groups and the differences in disease prevalence between groups. However, our secondary analyses show that our DL model had similar diagnostic accuracy across different age, sex, racial, and ethnic groups.

Importantly, gradient-weighted class activation maps showed that the most important parts of the chest radiograph for predicting the composite outcome were the lungs and heart. While many patients who present with symptoms of ACP syndrome will have external devices (eg, electrocardiography electrodes and wires), the model seemed to instead extract information from regions (heart and lungs) that correlate with our outcomes. Furthermore, more relevant clinical areas on the chest radiographs contributed to predictions after we fine-tuned our model, which may explain the improvement in discriminatory power. Nevertheless, regarding explainability, DL models are still a black box. While we might be able to identify which areas are used to generate the predictions, we still do not understand what information the model is using to make its decisions. Future research on model interpretability is crucial, as the lack of explainability may hinder clinical adaptation.

Our study had several limitations. First, our study was a retrospective analysis of patient data available from electronic health records from university teaching hospitals, which poses a selection bias. Second, as the leading symptom and initial triage information was not registered in the electronic health records, we identified patients using a complex inclusion criterion focusing on individuals in whom additional cardiovascular or pulmonary tests may be deferred. However, this may have included individuals who did not have ACP or whose symptoms were definitive of diagnosis. Also, this precluded us from including information regarding symptoms into our model, which could have potently altered our findings. Third, most of these patients were non-Hispanic White individuals. Fourth, we did not have detailed patient-level care data; therefore, further clinical benefit analyses (eg, the length of hospital stay) were not possible. Fifth, our results cover the era before the introduction of high-sensitivity troponin testing at these hospitals. Sixth, we analyzed individuals who underwent additional cardiovascular or pulmonary testing. Therefore, the results reflect the subgroup of individuals with ACP where there was sufficient suspicion to order an additional test. Performance to triage ACP in real clinical practice may vary.

In conclusion, our proof-of-concept study shows that deep learning evaluation of initial emergency department chest radiographs can help identify patients with acute chest pain (ACP) syndrome who are at risk for adverse outcomes. Our and future deep learning models may enable rapid triage of patients with ACP syndrome based on their initial chest radiographs to identify those in whom additional imaging may be deferred.

Disclosures of conflicts of interest: M.K. No relevant relationships. V.K.R. No relevant relationships. J.T.N. No relevant relationships. U.H. No relevant relationships. M.T.L. Institution received funding from AstraZeneca, Kowa, and Johnson & Johnson Innovation.

Acknowledgments

Computational resources were provided by the Massachusetts General Hospital Cardiovascular Imaging Research Center and the Massachusetts Life Sciences Center.

Author Contributions

Author contributions: Guarantor of integrity of entire study, M.K.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, M.K., J.T.N., U.H., M.T.L.; clinical studies, M.K., M.T.L.; statistical analysis, M.K., V.K.R., U.H.; and manuscript editing, all authors

M.K. supported by the National Institutes of Health (T32HL076136) and the Massachusetts General Hospital Ralph Schlaeger Fellowship. V.K.R. supported by the American Heart Association (career development award 935176). M.T.L. supported by the American Heart Association (award 810966).

References

  • 1. Cairns C, Kang K, Santo L. National Hospital Ambulatory Medical Care Survey: 2018 Emergency Department Summary Tables. https://www.cdc.gov/nchs/data/nhamcs/web_tables/2018-ed-web-tables-508.pdf Published 2018. Accessed June 25, 2022.
  • 2. Kohn MA, Kwan E, Gupta M, Tabas JA. Prevalence of acute myocardial infarction and other serious diagnoses in patients presenting to an urban emergency department with chest pain. J Emerg Med 2005;29(4):383–390.
  • 3. Lindsell CJ, Anantharaman V, Diercks D, et al. The Internet Tracking Registry of Acute Coronary Syndromes (i*trACS): a multicenter registry of patients with suspicion of acute coronary syndromes reported using the standardized reporting guidelines for emergency department chest pain studies. Ann Emerg Med 2006;48(6):666–677, 677.e1–677.e9.
  • 4. Hoffmann U, Truong QA, Schoenfeld DA, et al ROMICAT-II Investigators. Coronary CT angiography versus standard evaluation in acute chest pain. N Engl J Med 2012;367(4):299–308.
  • 5. Litt HI, Gatsonis C, Snyder B, et al. CT angiography for safe discharge of patients with possible acute coronary syndromes. N Engl J Med 2012;366(15):1393–1403.
  • 6. Goldstein JA, Chinnaiyan KM, Abidov A, et al; CT-STAT Investigators. The CT-STAT (Coronary Computed Tomographic Angiography for Systematic Triage of Acute Chest Pain Patients to Treatment) trial. J Am Coll Cardiol 2011;58(14):1414–1422.
  • 7. Smulders MW, Kietselaer BL, Schalla S, et al. Acute chest pain in the high-sensitivity cardiac troponin era: A changing role for noninvasive imaging? Am Heart J 2016;177:102–111.
  • 8. Hoot NR, Aronsky D. Systematic review of emergency department crowding: causes, effects, and solutions. Ann Emerg Med 2008;52(2):126–136.
  • 9. Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency department crowding: A systematic review of causes, consequences and solutions. PLoS One 2018;13(8):e0203316.
  • 10. Niehues SM, Adams LC, Gaudin RA, et al. Deep-Learning-Based Diagnosis of Bedside Chest X-ray in Intensive Care and Emergency Medicine. Invest Radiol 2021;56(8):525–534.
  • 11. Hwang EJ, Nam JG, Lim WH, et al. Deep Learning for Chest Radiograph Diagnosis in the Emergency Department. Radiology 2019;293(3):573–580.
  • 12. Kim JH, Han SG, Cho A, Shin HJ, Baek SE. Effect of deep learning-based assistive technology use on chest radiograph interpretation by emergency department physicians: a prospective interventional simulation-based study. BMC Med Inform Decis Mak 2021;21(1):311.
  • 13. Lu MT, Ivanov A, Mayrhofer T, Hosny A, Aerts HJWL, Hoffmann U. Deep Learning to Assess Long-term Mortality From Chest Radiographs. JAMA Netw Open 2019;2(7):e197416.
  • 14. Raghu VK, Weiss J, Hoffmann U, Aerts HJWL, Lu MT. Deep Learning to Estimate Biological Age From Chest Radiographs. JACC Cardiovasc Imaging 2021;14(11):2226–2236.
  • 15. Chung JH, Duszak R Jr, Hemingway J, Hughes DR, Rosenkrantz AB. Increasing Utilization of Chest Imaging in US Emergency Departments From 1994 to 2015. J Am Coll Radiol 2019;16(5):674–682.
  • 16. Nalichowski R, Keogh D, Chueh HC, Murphy SN. Calculating the benefits of a Research Patient Data Repository. AMIA Annu Symp Proc 2006;2006:1044.
  • 17. Van Rossum G, Drake FL Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam; 1995.
  • 18. Howard J, Gugger S. Fastai: a layered API for deep learning. Information (Basel) 2020;11(2):108.
  • 19. Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 2019; 32. https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  • 20. Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers 1999;10(3):61–74. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=rtWKzFwAAAAJ&citation_for_view=rtWKzFwAAAAJ:u-x6o8ySG0sC.
  • 21. Than M, Herbert M, Flaws D, et al. What is an acceptable risk of major adverse cardiac event in chest pain patients soon after discharge from the Emergency Department?: a clinical survey. Int J Cardiol 2013;166(3):752–754.
  • 22. R Core Team. R: A language and environment for statistical computing. 4th ed. R Foundation for Statistical Computing; 2019.
  • 23. Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011;12(1):77.
  • 24. Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open 2016;6(11):e012799.
  • 25. Lu MT, Raghu VK, Mayrhofer T, Aerts HJWL, Hoffmann U. Deep Learning Using Chest Radiographs to Identify High-Risk Smokers for Lung Cancer Screening Computed Tomography: Development and Validation of a Prediction Model. Ann Intern Med 2020;173(9):704–713.
  • 26. Challen K, Goodacre SW. Predictive scoring in non-trauma emergency patients: a scoping review. Emerg Med J 2011;28(10):827–837.
  • 27. Gibbs J, deFilippi C, Peacock F, et al. The utility of risk scores when evaluating for acute myocardial infarction using high-sensitivity cardiac troponin I. Am Heart J 2020;227:1–8.
  • 28. Laureano-Phillips J, Robinson RD, Aryal S, et al. HEART Score Risk Stratification of Low-Risk Chest Pain Patients in the Emergency Department: A Systematic Review and Meta-Analysis. Ann Emerg Med 2019;74(2):187–203.
  • 29. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 2018;15(11):e1002683.
  • 30. Oakden-Rayner L, Gale W, Bonham TA, et al. Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study. Lancet Digit Health 2022;4(5):e351–e358.
  • 31. Soekhoe D, van der Putten P, Plaat A. On the Impact of Data Set Size in Transfer Learning Using Deep Neural Networks. In: Boström H, Knobbe A, Soares C, Papapetrou P, eds. Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer; 2016; 50–60.
  • 32. Doudesis D, Lee KK, Yang J, et al; High-STEACS Investigators. Validation of the myocardial-ischaemic-injury-index machine learning algorithm to guide the diagnosis of myocardial infarction in a heterogenous population: a prespecified exploratory analysis. Lancet Digit Health 2022;4(5):e300–e308.
  • 33. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med 2018;178(11):1544–1547.

Article History

Received: Aug 1 2022
Revision requested: Sept 28 2022
Revision received: Oct 23 2022
Accepted: Nov 14 2022
Published online: Jan 17 2023