Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations

Published Online:https://doi.org/10.1148/ryai.230103

“Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content.

This prospective exploratory study conducted from January 2023 through May 2023 evaluated the ability of ChatGPT to answer questions from Brazilian Radiology Board examinations, exploring how different prompt strategies can influence performance using GPT-3.5 and GPT-4. Three multiple-choice board examinations that did not include image-based questions were evaluated: Radiology and Diagnostic Imaging, Mammography, and Neuroradiology. Five different styles of zero-shot prompting were tested: Raw Question, Brief Instruction (BI), Long Instruction (LI), Chain-of-Thought (CoT), and Question-Specific Automatic Prompt Generation (QAPG). The QAPG and BI prompt strategies performed best for all examinations (P < .05), obtaining passing scores (≥ 60%) on the Radiology and Diagnostic Imaging examination when testing both versions of ChatGPT. QAPG style achieved a score of 60% for the Mammography examination using GPT-3.5 and 76% using GPT-4. GPT-4 achieved up to 65% score in the neuroradiology examination. The LI style consistently underperformed, implying that excessive detail might harm performance. GPT-4’s scores were less sensitive to prompt style changes. The QAPG prompt style showed high volume of “A” option, but no statistical difference suggesting bias was found. GPT-4 passed all 3 radiology board examinations, and GPT-3.5 passed 2 of 3 examinations when using an optimal prompt style.

©RSNA, 2023

Article History

Published online: Nov 08 2023