GPT-4 in Radiology: Improvements in Advanced Reasoning
Abstract
Supplemental material is available for this article.
See also the article by Bhayana et al and the editorial by Lourenco et al in this issue.
Introduction
ChatGPT is a powerful neural network model that belongs to the generative pretrained transformer (GPT) family of large language models (LLMs). Despite being created primarily for humanlike conversations, ChatGPT has shown remarkable versatility and has the potential to revolutionize many industries. It was recently named the fastest growing application in history (1). ChatGPT based on GPT-3.5 nearly passed a text-based radiology examination, performing well on knowledge recall but struggling with higher-order thinking (2). OpenAI’s latest LLM, GPT-4, was released in March of 2023 in limited form to paid users alongside claims of enhanced advanced reasoning capabilities (3). GPT-4 demonstrated remarkable improvements over GPT-3.5 on professional and academic benchmarks, including the uniform bar examination (90th vs 10th percentile) and U.S. Medical Licensing Examination (>30% improvement) (4,5).
Despite improved performance on various general professional benchmarks, whether GPT-4’s enhanced advanced reasoning capabilities translate to improved performance in radiology, where the context of specific technical language is crucial, remains uncertain. The purpose of this exploratory study was to evaluate the performance of GPT-4 on a radiology board–style examination without images and compare it with that of GPT-3.5.
Materials and Methods
In this prospective study, the performance of GPT-4 was assessed on the same 150 multiple-choice text-based questions used to benchmark GPT-3.5, with the selection process and categorization described previously (2). Questions matched the style, content, and difficulty of the Canadian Royal College and American Board of Radiology examinations. GPT-4 performance was assessed overall, by question type, and by topic. GPT-4’s performance was compared with that previously reported for GPT-3.5 using the χ2 test (2). Confidence of language in responses was assessed using a Likert scale (1 = no confidence, 5 = high confidence) as described previously (2) and compared with that of GPT-3.5 using the Mann-Whitney U test.
Results
GPT-4 answered 81% of questions correctly (121 of 150), exceeding the passing threshold of 70% and outperforming GPT-3.5, which answered 69% of questions correctly (104 of 150) (P = .02). The Table shows the performance of GPT-4 and GPT-3.5, stratified by question type and topic.
GPT-4 performed better than GPT-3.5 on higher-order thinking questions (81% vs 60% [72 vs 53 of 89], respectively; P = .002), more specifically those involving description of imaging findings (85% vs 61% [39 vs 28 of 46], P = .009) and application of concepts (90% vs 30% [nine vs three of 10], P = .006) (Figure). GPT-4 showed no improvement over GPT-3.5 on lower-order questions (80% vs 84% [49 vs 51 of 61], respectively; P = .64). GPT-4 performed better than GPT-3.5 on physics (87% vs 40% [13 vs six of 15 questions]; P = .008). GPT-4 incorrectly responded to 12 questions that GPT-3.5 answered correctly. Nine of those 12 questions (75%) were lower-order questions. Example questions and responses from GPT-4 and 3.5 are given in Figures S1–S5.
GPT-4 expressed confidence or high confidence in all responses, including incorrect answers (100%, 29 of 29 questions). Confidence expressed in incorrect answers was no different between GPT-4 and GPT-3.5 (4.9 vs 4.8, respectively; P = .40).
Discussion
Our study demonstrates an impressive improvement in performance of ChatGPT in radiology over a short time period, highlighting the growing potential of LLMs in this context. GPT-4 passed a text-based radiology board–style examination, outperforming GPT-3.5 and exceeding the passing threshold by more than 10%, which is in line with improvements in other domains (4–6).
GPT-4’s marked improvement over GPT-3.5 on higher-order questions supports that its claimed enhanced advanced reasoning capabilities translate to a radiology context. Improved performance on questions involving descriptions of imaging findings and application of concepts suggest better contextual understanding of radiology-specific terminology, which is crucial for more advanced downstream applications in radiology such as generating accurate differential diagnoses.
GPT-4’s lack of improvement on lower-order questions, and incorrect responses to lower-order questions that GPT-3.5 answered correctly, raise questions related to GPT-4’s reliability for information gathering. GPT-4 still confidently phrases inaccurate responses. ChatGPT’s dangerous tendency to produce inaccurate responses, termed “hallucinations,” is less frequent in GPT-4 but still limits usability in medical education and practice at present.
Overall, the rapid advancement of these models is exciting. Applications built on GPT-4 with radiology-specific fine-tuning should be explored further.
Author Contributions
Author contributions: Guarantor of integrity of entire study, R.B.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, all authors; experimental studies, R.B., R.R.B.; statistical analysis, R.B.; and manuscript editing, all authors
References
- 1. . ChatGPT sets record for fastest-growing user base: analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/. Published February 2, 2023. Accessed April 4, 2023.
- 2. . Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 2023;307(5):e230582.
- 3. GPT-4. https://openai.com/research/gpt-4. Accessed April 4, 2023.
- 4. . Capabilities of GPT-4 on Medical Challenge Problems. https://www.microsoft.com/en-us/research/publication/2023/03/GPT-4_medical_benchmarks-641a308e45ba9.pdf. Published March 20, 2023. Accessed April 4, 2023.
- 5. . GPT-4 Technical Report. http://arxiv.org/abs/2303.08774. Published March 15, 2023. Accessed April 4, 2023.
- 6. . Artificial intelligence–based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol 2023. https://doi.org/10.1111/aos.15661. Published online March 13, 2023.
Article History
Received: Apr 17 2023Revision requested: Apr 24 2023
Revision received: Apr 24 2023
Accepted: Apr 26 2023
Published online: May 16 2023