Original ResearchFree Access

GPT-4 in Radiology: Improvements in Advanced Reasoning

Published Online:https://doi.org/10.1148/radiol.230987

Abstract

Supplemental material is available for this article.

See also the article by Bhayana et al and the editorial by Lourenco et al in this issue.

Introduction

ChatGPT is a powerful neural network model that belongs to the generative pretrained transformer (GPT) family of large language models (LLMs). Despite being created primarily for humanlike conversations, ChatGPT has shown remarkable versatility and has the potential to revolutionize many industries. It was recently named the fastest growing application in history (1). ChatGPT based on GPT-3.5 nearly passed a text-based radiology examination, performing well on knowledge recall but struggling with higher-order thinking (2). OpenAI’s latest LLM, GPT-4, was released in March of 2023 in limited form to paid users alongside claims of enhanced advanced reasoning capabilities (3). GPT-4 demonstrated remarkable improvements over GPT-3.5 on professional and academic benchmarks, including the uniform bar examination (90th vs 10th percentile) and U.S. Medical Licensing Examination (>30% improvement) (4,5).

Despite improved performance on various general professional benchmarks, whether GPT-4’s enhanced advanced reasoning capabilities translate to improved performance in radiology, where the context of specific technical language is crucial, remains uncertain. The purpose of this exploratory study was to evaluate the performance of GPT-4 on a radiology board–style examination without images and compare it with that of GPT-3.5.

Materials and Methods

In this prospective study, the performance of GPT-4 was assessed on the same 150 multiple-choice text-based questions used to benchmark GPT-3.5, with the selection process and categorization described previously (2). Questions matched the style, content, and difficulty of the Canadian Royal College and American Board of Radiology examinations. GPT-4 performance was assessed overall, by question type, and by topic. GPT-4’s performance was compared with that previously reported for GPT-3.5 using the χ2 test (2). Confidence of language in responses was assessed using a Likert scale (1 = no confidence, 5 = high confidence) as described previously (2) and compared with that of GPT-3.5 using the Mann-Whitney U test.

Results

GPT-4 answered 81% of questions correctly (121 of 150), exceeding the passing threshold of 70% and outperforming GPT-3.5, which answered 69% of questions correctly (104 of 150) (P = .02). The Table shows the performance of GPT-4 and GPT-3.5, stratified by question type and topic.

Performance of GPT-4 and GPT-3.5 on Radiology Board–style Multiple-Choice Questions without Images, Stratified by Question Type and Topic

GPT-4 performed better than GPT-3.5 on higher-order thinking questions (81% vs 60% [72 vs 53 of 89], respectively; P = .002), more specifically those involving description of imaging findings (85% vs 61% [39 vs 28 of 46], P = .009) and application of concepts (90% vs 30% [nine vs three of 10], P = .006) (Figure). GPT-4 showed no improvement over GPT-3.5 on lower-order questions (80% vs 84% [49 vs 51 of 61], respectively; P = .64). GPT-4 performed better than GPT-3.5 on physics (87% vs 40% [13 vs six of 15 questions]; P = .008). GPT-4 incorrectly responded to 12 questions that GPT-3.5 answered correctly. Nine of those 12 questions (75%) were lower-order questions. Example questions and responses from GPT-4 and 3.5 are given in Figures S1–S5.

Examples of ChatGPT’s response to a higher-order thinking question                     (top) involving calculation of absolute washout in an adrenal nodule. The                     correct answer is 70% (option D). GPT-3.5 (green icon, middle) included an                     inaccurate absolute washout formula. The subsequent calculation and answer                     (option A) were incorrect. GPT-4 (black icon, bottom) included the correct                     formula for absolute washout and calculated the absolute washout correctly as                     70% (option D).

Examples of ChatGPT’s response to a higher-order thinking question (top) involving calculation of absolute washout in an adrenal nodule. The correct answer is 70% (option D). GPT-3.5 (green icon, middle) included an inaccurate absolute washout formula. The subsequent calculation and answer (option A) were incorrect. GPT-4 (black icon, bottom) included the correct formula for absolute washout and calculated the absolute washout correctly as 70% (option D).

GPT-4 expressed confidence or high confidence in all responses, including incorrect answers (100%, 29 of 29 questions). Confidence expressed in incorrect answers was no different between GPT-4 and GPT-3.5 (4.9 vs 4.8, respectively; P = .40).

Discussion

Our study demonstrates an impressive improvement in performance of ChatGPT in radiology over a short time period, highlighting the growing potential of LLMs in this context. GPT-4 passed a text-based radiology board–style examination, outperforming GPT-3.5 and exceeding the passing threshold by more than 10%, which is in line with improvements in other domains (46).

GPT-4’s marked improvement over GPT-3.5 on higher-order questions supports that its claimed enhanced advanced reasoning capabilities translate to a radiology context. Improved performance on questions involving descriptions of imaging findings and application of concepts suggest better contextual understanding of radiology-specific terminology, which is crucial for more advanced downstream applications in radiology such as generating accurate differential diagnoses.

GPT-4’s lack of improvement on lower-order questions, and incorrect responses to lower-order questions that GPT-3.5 answered correctly, raise questions related to GPT-4’s reliability for information gathering. GPT-4 still confidently phrases inaccurate responses. ChatGPT’s dangerous tendency to produce inaccurate responses, termed “hallucinations,” is less frequent in GPT-4 but still limits usability in medical education and practice at present.

Overall, the rapid advancement of these models is exciting. Applications built on GPT-4 with radiology-specific fine-tuning should be explored further.

Disclosures of conflicts of interest: R.B. No relevant relationships. R.R.B. No relevant relationships. S.K. No relevant relationships.

Author Contributions

Author contributions: Guarantor of integrity of entire study, R.B.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, all authors; experimental studies, R.B., R.R.B.; statistical analysis, R.B.; and manuscript editing, all authors

References

Article History

Received: Apr 17 2023
Revision requested: Apr 24 2023
Revision received: Apr 24 2023
Accepted: Apr 26 2023
Published online: May 16 2023