Original ResearchFree Access

Diagnostic Performance of ChatGPT from Patient History and Imaging Findings on the Diagnosis Please Quizzes

Published Online:https://doi.org/10.1148/radiol.231040

Introduction

Artificial intelligence advancements, particularly in large language models like ChatGPT which is based on the GPT-4 architecture (1), have started showing remarkable capabilities in diverse areas (2,3). However, the potential of ChatGPT in radiology remains underexplored. This study will investigate the ability of ChatGPT to solve diagnostic quizzes from the journal Radiology to assess its potential as a diagnostic aid and decision support system.

Materials and Methods

This study evaluated the diagnostic ability of GPT-4–based ChatGPT (4) using patient history and image findings from the educational Diagnosis Please quizzes in Radiology. First, the clinical history and images are given in the case presentation. Then, the image findings and diagnosis answers are presented in the case commentary. As ChatGPT cannot process images, it was given image findings. Since this study relied on published articles, no ethical approval was needed. This study was designed in accordance with Standards for Reporting of Diagnostic Accuracy Studies guidelines (5). Study overview is presented in the Figure.

Study overview. This study evaluates the usefulness of GPT-4–based                     ChatGPT in radiology using patient history and image findings in Diagnosis                     Please quizzes from 1998 to 2023. With a 54% (170 of 313) overall accuracy,                     ChatGPT shows potential as a valuable diagnostic tool in radiology.

Study overview. This study evaluates the usefulness of GPT-4–based ChatGPT in radiology using patient history and image findings in Diagnosis Please quizzes from 1998 to 2023. With a 54% (170 of 313) overall accuracy, ChatGPT shows potential as a valuable diagnostic tool in radiology.

We consecutively collected 313 Diagnosis Please cases from April 1998 to April 2023 and extracted patient history and imaging findings. ChatGPT was asked to list the differential diagnoses, first based only on the patient’s history, then based only on the imaging findings, then based on both. ChatGPT was then asked to generate a final diagnosis. Two radiologists (D.H., Y. Mitsuyama; 7 and 4 years of experience, respectively) compared and confirmed that the generated diagnoses coincided with the published diagnoses. The legitimacy of differential diagnoses generated by ChatGPT was also evaluated by the same radiologists on a five-point Likert scale ranging from 1 (highly illegitimate) to 5 (highly legitimate). In addition, the specialty of each case was determined. If there was a discrepancy, a third senior radiologist (H. Takita, 8 years of experience) made the decision.

Given that Radiology articles may have been incorporated into the GPT-4 training, accuracies for cases before and after the September 2021 cutoff (1) were compared using the Fisher exact test with R software (version 4.0.0; https://www.r-project.org/).

Results

The diagnostic performance of ChatGPT in differential diagnoses (a) from submitter-provided patient history was 22% (68 of 313), (b) from submitter-identified image findings was 57% (177 of 313), (c) from both history and image findings was 61% (191 of 313), and (d) for the final diagnosis was 54% (170 of 313). All P values between the periods showed no significant differences (ranging from P = .124 to P >.99). In the final diagnosis, the highest accuracy was achieved in cardiovascular radiology, at 79% (23 of 29), whereas the lowest accuracy was in musculoskeletal radiology, at 42% (28 of 66). These are shown in the Table. The evaluation of ChatGPT-generated diagnoses yielded a median legitimacy score of 5.0. Scores were at least 3.0 for the lower quartile and reached 5.0 for the upper quartile.

Overall Results by Time Period with P Values

Discussion

The diagnostic performance of ChatGPT for Diagnosis Please cases was 54% using both the history and the imaging findings, with a median legitimacy score of 5.0, indicating modestly reliable results. Since the performance varies by specialty, users should be aware when they use it.

Identification of differential diagnoses and settling on a final diagnosis can be challenging for radiologists. ChatGPT may lighten their workload by delivering immediate and dependable diagnostic outcomes (6). This tool could prove particularly beneficial in circumstances where there is a shortage of radiologists (7).

However, this study had limitations. These include that ChatGPT cannot evaluate images directly, Diagnosis Please represents a controlled setting that may not reflect real-world complexities, and the use of post hoc author-written findings might have contributed to the relatively high performance. In addition, cases published through September 2021 might have been used for ChatGPT training.

In conclusion, this study indicates the potential of ChatGPT as a decision support system in radiology.

Disclosures of conflicts of interest: D.U. No relevant relationships. Y.M. No relevant relationships. H. Takita grant from JSPS KAKENHI (23K14899) and Bayer Yakuhin (Bayer Academic Support BASJ20220408012). D.H. No relevant relationships. S.L.W. No relevant relationships. H. Tatekawa No relevant relationships. Y.M. No relevant relationships.

Acknowledgment

We acknowledge parts of this article were generated with ChatGPT-4 (powered by OpenAI’s language model; http://openai.com), but the output was confirmed by the authors.

Author Contributions

Author contributions: Guarantors of integrity of entire study, D.U., Y. Mitsuyama; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, D.U., Y. Mitsuyama; clinical studies, Y. Mitsuyama, D.H.; experimental studies, D.U.; statistical analysis, D.U., H. Tatekawa; and manuscript editing, D.U., Y. Mitsuyama, H. Takita, S.L.W., H. Tatekawa, Y. Miki

Supported by Iida Group Holdings.

References

  • 1. Open AI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774. Posted March 15, 2023. Accessed March 16, 2023.
  • 2. Eloundou T, Manning S, Mishkin P, Rock D. GPTs are GPTs: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130. https://arxiv.org/abs/2303.10130. Posted March 17, 2023. Accessed March 18, 2023.
  • 3. Ueda D, Walston SL, Matsumoto T, Deguchi R, Tatekawa H, Miki Y. Evaluating GPT-4-based ChatGPT’s Clinical Potential on the NEJM Quiz. medRxiv 2023.05.04.23289493. Posted May 5, 2023. Accessed May 10, 2023.
  • 4. GPT-4. OpenAI. https://openai.com/gpt-4. Accessed April 24, 2023.
  • 5. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. Radiology 2015;277(3):826–832.
  • 6. Juluru K, Shih HH, Keshava Murthy KN, et al. Integrating Al Algorithms into the Clinical Workflow. Radiol Artif Intell 2021;3(6):e210013.
  • 7. Mollura DJ, Culp MP, Pollack E, et al. Artificial Intelligence in Low- and Middle-Income Countries: Innovating Global Health Radiology. Radiology 2020;297(3):513–520.

Article History

Received: Apr 25 2023
Revision requested: May 12 2023
Revision received: June 23 2023
Accepted: June 29 2023
Published online: July 18 2023