The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation
Abstract
OpenAI’s GPT-4 with Advanced Data Analysis autonomously performed basic statistical analyses and built machine learning–based models to evaluate a large chest radiography dataset, performing similarly to its manually crafted counterpart.
Background
Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI’s GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology.
Purpose
To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset.
Materials and Methods
In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6–10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping.
Results
A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist–developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27).
Conclusion
LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling.
© RSNA, 2024
References
- 1. . AI in health and medicine. Nat Med 2022;28(1):31–38.
- 2. . Artificial Intelligence and Machine Learning in Clinical Medicine, 2023. N Engl J Med 2023;388(13):1201–1208.
- 3. . Large language models encode clinical knowledge. Nature 2023;620(7972):172–180. [Published correction appears in Nature 2023;620(7973):E19.]
- 4. . Clinical use of artificial intelligence products for radiology in the Netherlands between 2020 and 2022. Eur Radiol 2024;34(1):348–354.
- 5. . Illusion of knowledge in statistics among clinicians: evaluating the alignment between objective accuracy and subjective confidence, an online survey. Cogn Res Princ Implic 2023;8(1):23.
- 6. . Survey of research participation amongst UK radiology trainees: aspirations, barriers, solutions and the Radiology Academic Network for Trainees (RADIANT). Clin Radiol 2021;76(4):302–309.
- 7. . AI applications in musculoskeletal imaging: a narrative review. Eur Radiol Exp 2024;8(1):22.
- 8. . Large language models in medicine. Nat Med 2023;29(8):1930–1940.
- 9. . OpenAI. https://openai.com/index/gpt-4/. Accessed from December 9, 2023, to January 29, 2024.
- 10. . Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol 2024;30(2):80–90.
- 11. . Code Interpreter for Bioinformatics: Are We There Yet? Ann Biomed Eng 2024;52(4):754–756.
- 12. . Large language models streamline automated machine learning for clinical studies. Nat Commun 2024;15(1):1603.
- 13. . Artificial Intelligence for Clinical Interpretation of Bedside Chest Radiographs. Radiology 2023;307(1):e220510.
- 14. . Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach. Comput Math Methods Med 2017;2017:3762651.
- 15. . Bootstrapping and permuting paired t-test type statistics. Stat Comput 2014;24(3):283–296.
- 16. . Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Am Math Soc 1943;54(3):426–482.
- 17. . Random Forests. Mach Learn 2001;45(1):5–32.
- 18. . Greedy function approximation: A gradient boosting machine. Ann Stat 2001;29(5):1189–1232.
- 19. . Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets. SMU Data Sci Rev 2018;1(3):9. https://scholar.smu.edu/datasciencereview/vol1/iss3/9.
- 20. . When Do Neural Nets Outperform Boosted Trees on Tabular Data? In: Oh A, Neumann T, Globerson A, Saenko K, Hardt M, Levine S, eds.
Advances in Neural Information Processing Systems NeurIPS . Curran Associates, 2023; 76336–76369. https://proceedings.neurips.cc/paper_files/paper/2023/file/f06d5ebd4ff40b40dd97e30cee632123-Paper-Datasets_and_Benchmarks.pdf. - 21. . Tabular data: Deep learning is not all you need. Inf Fusion 2022;81:84–90.
- 22. . Why do tree-based models still outperform deep learning on typical tabular data? In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds.
Advances in Neural Information Processing Systems NeurIPS . Curran Associates, 2022; 507–520. https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf. - 23. . Comparison of variable selection methods for clinical predictive modeling. Int J Med Inform 2018;116:10–17.
- 24. . ChatGPT and global public health: Applications, challenges, ethical considerations and mitigation strategies. Glob Transit 2023;5:50–54.
- 25. . Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res 2023;25:e48009.
- 26. . Ethics of large language models in medicine and medical research. Lancet Digit Health 2023;5(6):e333–e335.
- 27. . Foundation models for generalist medical artificial intelligence. Nature 2023;616(7956):259–265.
- 28. . Need an AI-Enabled, Next-Generation, Advanced ChatGPT or Large Language Models (LLMs) for Error-Free and Accurate Medical Information. Ann Biomed Eng 2024;52(2):134–135.
- 29. . LLaMA: Open and Efficient Foundation Language Models. arXiv 2302.13971 [preprint] https://arxiv.org/abs/2302.13971. Posted February 27, 2023. Accessed November 10, 2023.
- 30. . Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases. Radiology 2024;310(1):e232411.
- 31. . How is ChatGPT’s behavior changing over time? arXiv 2307.09009 [preprint] https://arxiv.org/abs/2307.09009. Posted October 31, 2023. Accessed December 18, 2023.
- 32. . A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2302.11382 [preprint] https://arxiv.org/abs/2302.11382. Posted February 21, 2023. Accessed August 22, 2023.
- 33. . Missing data in medical databases: impute, delete or classify? Artif Intell Med 2013;58(1):63–72.
- 34. . A survey on missing data in machine learning. J Big Data 2021;8(1):140.
- 35. , . GPT-4 Technical Report. arXiv 2303.08774 [preprint] https://arxiv.org/abs/2303.08774. Posted March 15, 2023. Accessed October 2, 2023.
- 36. . A training algorithm for optimal margin classifiers.
In: COLT ‘92: Proceedings of the fifth annual workshop on Computational learning theory . ACM, 1992; 144–152. - 37. . A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J Comput Syst Sci 1997;55(1):119–139.
- 38. . LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). NeurIPS, 2017; 3149–3157. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html.
- 39. . An introduction to statistical learning. Springer, 2013.
- 40. . Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 2019;33(10):913–933.
Article History
Received: Dec 28 2023Revision requested: Jan 18 2024
Revision received: Sept 6 2024
Accepted: Sept 9 2024
Published online: Nov 12 2024