Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations
Abstract
This prospective exploratory study conducted from January 2023 through May 2023 evaluated the ability of ChatGPT to answer questions from Brazilian radiology board examinations, exploring how different prompt strategies can influence performance using GPT-3.5 and GPT-4. Three multiple-choice board examinations that did not include image-based questions were evaluated: (a) radiology and diagnostic imaging, (b) mammography, and (c) neuroradiology. Five different styles of zero-shot prompting were tested: (a) raw question, (b) brief instruction, (c) long instruction, (d) chain-of-thought, and (e) question-specific automatic prompt generation (QAPG). The QAPG and brief instruction prompt strategies performed best for all examinations (P < .05), obtaining passing scores (≥60%) on the radiology and diagnostic imaging examination when testing both versions of ChatGPT. The QAPG style achieved a score of 60% for the mammography examination using GPT-3.5 and 76% using GPT-4. GPT-4 achieved a score up to 65% in the neuroradiology examination. The long instruction style consistently underperformed, implying that excessive detail might harm performance. GPT-4’s scores were less sensitive to prompt style changes. The QAPG prompt style showed a high volume of the “A” option but no statistical difference, suggesting bias was found. GPT-4 passed all three radiology board examinations, and GPT-3.5 passed two of three examinations when using an optimal prompt style.
Keywords: ChatGPT, Artificial Intelligence, Board Examinations, Radiology and Diagnostic Imaging, Mammography, Neuroradiology
© RSNA, 2023
See also the commentary by Trivedi and Gichoya in this issue.
Summary
ChatGPT passed radiology and diagnostic imaging, neuroradiology, and mammography board examinations, with GPT-4 showing substantial improvement over GPT-3.5. Performance was impacted by different prompt styles.
Key Points
■ ChatGPT using GPT-4 passed all the three Brazilian College of Radiology board examinations (radiology and diagnostic imaging, neuroradiology, and mammography) included in this study, whereas GPT-3.5 passed only two examinations.
■ Question-specific automatic prompt generation and brief instruction prompt styles led to the best performance in question-answering, passing all three specialty board tests.
■ GPT-4 showed a performance increase of 21.3% on average (P < .01) compared with the previous version.
Introduction
Artificial intelligence in health care has been attracting much attention in recent years. It has been shown to be a valuable asset in addressing numerous challenges in health care, such as accurately identifying findings in medical imaging (1,2), improving health screening programs (3), improving the patient experience, and reducing medical errors (4).
Large language models (LLMs) based on transformers are the state-of-the-art deep learning architecture for most natural language processing tasks (5). Performance of these models scales with the amount of training data and model size (6), which is expected to increase rapidly, more than doubling every 6 months (7). Recently, LLMs demonstrated ability to simulate other neural networks internally, making them excellent in-context learners (8,9).
LLMs receive a text fragment (prompt) as input and return a completion or response. When prompting, one can pass one or more examples of what is expected as a response, which is called “few-shot prompting,” or not pass an example, which is called “zero-shot prompting” (10). “Chain-of-thought prompting” improves LLMs’ performance in mathematical and reasoning tasks (11). More recently, it was found that automatic prompt engineering, which uses optimized prompts, can achieve human-level performance in various tasks (12).
Medical board examinations are an effective and objective method for assessing LLMs’ knowledge in the medical field. Several medical examination benchmarks and datasets, like MedQA (13) and MedMCQA (14), are intended to objectively measure LLM scores in answering multiple-choice questions from real-life medical board examinations. Transformer-based LLMs trained on large amounts of data currently have the best scores in these benchmarks (15).
ChatGPT’s capabilities for the health care industry are already being evaluated (16,17), which is a crucial step before using it in real-world clinical tools (18). It holds great potential for improving processes and disrupting the health care industry. However, ChatGPT’s performance in radiology medical board examinations remains unknown.
In this study, we evaluated ChatGPT’s ability to answer multiple-choice questions from the Brazilian radiology and medical imaging board examinations offered by the official licensing entity, the Brazilian College of Radiology (CBR), excluding the tests containing image-based assessment. Moreover, we explored how different prompt strategies can influence the model’s score and shed light on the opportunities of using ChatGPT’s knowledge in health care and radiology domains.
Materials and Methods
This prospective exploratory study was conducted from January 2023 through May 2023. Since it did not include any human subjects or patient data, the study was not required to get approval from the institutional review board.
Data
The data used in this study were obtained from the CBR’s website. Three 2022 board examinations were included in this study: radiology and diagnostic imaging, mammography, and neuroradiology, containing 60, 50, and 80 questions, respectively. These board examinations consist of a practical and a theoretical phase, and this study focuses on the latter. Other radiology specialties’ board examinations were excluded from this study as they contained image-based questions. The examinations were conducted on July 3, 2022, and were subsequently made available online. Since the training data for ChatGPT includes only information up until September 2021, there is no possibility that these examinations were included in the model’s training data.
The raw nonstructured data were transformed into a table of four columns (question number, raw questions, options, and correct answers) in which each line corresponds to a question. As each question has five options as possible answers but only one correct answer, randomly guessing the response across the whole test would most likely result in a grade of 20%.
The minimum threshold for passing these examinations is a score of 60%. Questions were categorized by the authors in knowledge retrieval (n = 69), interpretation of radiologic findings (n = 56), clinical decision-making (n = 42), or therapeutic management (n = 23). Per-category performance was obtained to detect patterns in incorrect answers.
Evaluated Models
OpenAI’s GPT-4 and GPT-3.5 large language models were included in this study. The models were accessed through the official chat completion application programming interface, with the maximum tokens set to 2048 and temperature set to 0.5. To facilitate understanding, we refer to both models as ChatGPT.
Prompts
This study tested five different styles of zero-shot prompting, including raw, brief instruction (BI), long instruction (LI), chain-of-thought, and question-specific automatic prompt generation (QAPG) styles (Fig 1).
![Examples of the five prompt styles evaluated in this study. Styles two, three, and four consist of appending an instruction in the beginning or end of each raw question. In question-specific automatic prompt generation, first we instructed ChatGPT to create a prompt. Then, we passed the question and the generated text as a new prompt. The blue dotted lines indicate that the text that was added to the raw questions in each style. CSF = cerebrospinal fluid, LLM = large language model. Examples of the five prompt styles evaluated in this study. Styles two, three, and four consist of appending an instruction in the beginning or end of each raw question. In question-specific automatic prompt generation, first we instructed ChatGPT to create a prompt. Then, we passed the question and the generated text as a new prompt. The blue dotted lines indicate that the text that was added to the raw questions in each style. CSF = cerebrospinal fluid, LLM = large language model.](/cms/10.1148/ryai.230103/asset/images/medium/ryai.230103.fig1.gif)
Figure 1: Examples of the five prompt styles evaluated in this study. Styles two, three, and four consist of appending an instruction in the beginning or end of each raw question. In question-specific automatic prompt generation, first we instructed ChatGPT to create a prompt. Then, we passed the question and the generated text as a new prompt. The blue dotted lines indicate that the text that was added to the raw questions in each style. CSF = cerebrospinal fluid, LLM = large language model.
Raw.— Most questions already contained instructions like “choose the correct/incorrect option” between the question and the options text. In this style, the raw, unedited question and its options were the input.
BI style.—A few words instructing the model that this is a question-answering problem were prepended to the raw question: “Choose only the letter that corresponds to the most likely correct answer. [raw question].”
LI style.—A more detailed instruction text was prepended: “You are a highly skilled physician in Brazil; you possess a thorough understanding of the best-practice guidelines for diagnosing and treating diseases. You are now doing a radiology medical board examination. Choose a single, correct answer for this question: [raw question].”
Chain-of-thought style.—This style consisted of adding the following text after each raw question: “[raw question] Let us think about this step by step.”
QAPG style.—This was similar to the automatic prompt engineering approach by Zhou et al (12). However, in this study, we instructed ChatGPT to create a prompt to make LLMs answer each question correctly.
This study did not use other strategies, such as few-shot learning or contextualization.
Statistical Analysis
To reduce the impact of the known randomness present in ChatGPT answers, each examination was performed by each of the prompt styles and ChatGPT version combination five times. The definitive examination score for each model and prompt style is presented as the median of the five examination scores in each repetition. The raw output was recorded into a column, and the final answer was extracted using regular expressions and stored in the final answer column. The authors corrected the final answers if they did not match the actual ChatGPT’s response.
The Wilcoxon signed rank test was used to compare examination scores across ChatGPT versions. The Friedman test was applied to assess prompt style score distributions, and a Nemenyi test was used for pairwise comparisons. The repetition with the median grade was used for question-dependent statistics (observed agreement and option bias). The first repetition to reach the median was considered if multiple did. The observed agreement of option choices was calculated to explore similarity between the choices made by varying combinations of ChatGPT versions and prompt styles. Spearman rank correlation coefficient examined answer option bias for each prompt style. P values less than .05 were considered statistically significant. SciPy version 1.11.1 for Python (Python Software Foundation) was used for statistical analysis.
Results
Accuracy
When using GPT-3.5, the QAPG and the BI prompt strategies showed the best performance in the three examinations tested, achieving examination scores of 38 of 60 (63.3%) and 37 of 60 (61.7%), respectively, in the radiology and diagnostic imaging examination, which are passing scores. The QAPG style also achieved a score of 30 of 50 (60% score) in the mammography board examination, thus being approved in that radiology specialty as well. GPT-3.5 could not achieve a passing score in the neuroradiology examination with any prompting style.
GPT-4 demonstrated an average improvement of 21.3% (P < .01) over its predecessor, achieving a passing score in all three board examinations for all of the prompting styles (P < .05). While only small differences in performance were found when comparing the different prompting styles, the QAPG prompting method demonstrated the highest scores in all three radiology specialties (P < .05 vs all other styles, except for BI in the mammography and radiology examinations on GPT-3.5). These findings are summarized in the Table.
![]() |
In both ChatGPT versions, the LI prompt style consistently performed worse than other styles, suggesting that overly detailed instructions may result in confusion for the model. The results for the chain-of-thought prompting style were inconsistent and showed only minimal differences from the raw prompt, indicating that this strategy may not be an effective approach for this task. Every score in this study surpassed the 20% mark, which represents randomly guessing the correct answer.
When analyzing the questions by type, it was observed that GPT-4 performed better in questions that involved interpreting described radiologic findings and clinical decision-making, with overall scores of 47 of 56 (83.9%) and 35 of 42 (83.3%), respectively, compared with scores of 39 of 56 (69.6%) and 30 of 42 (71.4%) for GPT-3.5. The QAPG prompting style showed an advantage primarily in questions related to knowledge retrieval, scoring 56 of 69 (81.2%) compared with 47 of 69 (68.1%) in the raw prompting style.
Observed Agreement
The observed agreement between the different prompts styles was measured (Fig 2). The lowest agreement was 52%, which was observed between the lowest performing prompting style (LI) and the highest performing style (QAPG). High agreement was found between the three lowest performing prompting styles, suggesting a possible answering bias when the model may not reach a correct answer. GPT-4 showed more similar answers between prompt style, with a minimum agreement of 76% between the QAPG and LI styles.
![Matrices of the rate of answer agreement between the prompt styles for each ChatGPT version. In the GPT-3.5 matrix, there is a strong correlation between raw, long, and brief instruction styles, whereas there is a noticeable disagreement between the question-specific automatic prompt generation (QAPG) and all other styles. However, the same pattern cannot be confirmed for GPT-4. Matrices of the rate of answer agreement between the prompt styles for each ChatGPT version. In the GPT-3.5 matrix, there is a strong correlation between raw, long, and brief instruction styles, whereas there is a noticeable disagreement between the question-specific automatic prompt generation (QAPG) and all other styles. However, the same pattern cannot be confirmed for GPT-4.](/cms/10.1148/ryai.230103/asset/images/medium/ryai.230103.fig2.gif)
Figure 2: Matrices of the rate of answer agreement between the prompt styles for each ChatGPT version. In the GPT-3.5 matrix, there is a strong correlation between raw, long, and brief instruction styles, whereas there is a noticeable disagreement between the question-specific automatic prompt generation (QAPG) and all other styles. However, the same pattern cannot be confirmed for GPT-4.
Answering Bias
The correct option distribution was balanced, close to the expected 20% for each option. All prompting styles had many “letter A” choices which could indicate bias toward the letter closer to the prompt, but no statistical significance was found, suggesting absence of letter proximity bias (Fig 3).
![Graphs show the distribution of options chosen as correct by each prompt style and the distribution of correct options (in black) for all three examinations. The number of correct answers were well balanced between the five options. No statistical significance was found between the option distribution of prompt styles in both versions of ChatGPT. QAPG = question-specific automatic prompt generation. Graphs show the distribution of options chosen as correct by each prompt style and the distribution of correct options (in black) for all three examinations. The number of correct answers were well balanced between the five options. No statistical significance was found between the option distribution of prompt styles in both versions of ChatGPT. QAPG = question-specific automatic prompt generation.](/cms/10.1148/ryai.230103/asset/images/medium/ryai.230103.fig3.gif)
Figure 3: Graphs show the distribution of options chosen as correct by each prompt style and the distribution of correct options (in black) for all three examinations. The number of correct answers were well balanced between the five options. No statistical significance was found between the option distribution of prompt styles in both versions of ChatGPT. QAPG = question-specific automatic prompt generation.
Discussion
This study evaluated ChatGPT’s question-answering capabilities in the radiology field. GPT-4 was significantly more proficient than its predecessor, having successfully passed all three board examinations. From the five prompt styles tested, we found that QAPG was the most successful, achieving a passing score in the radiology and diagnostic imaging and mammography board examinations in both versions of ChatGPT. A brief instruction to the model before the question also showed promising results. We noticed a high frequency of option A choices, which we hypothesized could be caused by option proximity bias, but no statistical difference on option choice distribution was found.
Other authors found accuracies between 42% and 64% were achieved by GPT-3.5 in medical board examinations, similar to the scores found when using this model on radiology board examinations in our study (17). GPT-4 showed significantly higher scores between 63% and 80.6%.
Although medical board examinations are already a benchmark for LLMs, this is, to our knowledge, the first evaluation of a language model in a corpus of radiology questions and the first in Brazilian Portuguese. The CBR conducts annual board examinations that serve as a comprehensive assessment of candidates’ knowledge in radiology and their related fields of practice. Comparable to the American Board of Radiology’s qualifying examination, this test represents the first step toward certification in radiology. The examination comprises questions of varying complexity to evaluate candidates’ expertise, ability to interpret findings, and clinical decision-making capabilities.
Although these examinations encompass a broad range of radiology knowledge through their diverse questions, achieving a passing score does not equate to radiology certification for human candidates or ChatGPT. It was observed that ChatGPT possesses adequate proficiency in answering questions related to radiology, mammography, and neuroradiology. However, it is crucial to acknowledge that this represents merely a narrow aspect of the requisite expertise for practicing radiologists. These medical professionals engage in a multitude of responsibilities extending beyond the scope of such examinations, encompassing the interpretation of medical images, execution of interventional procedures, collaboration within multidisciplinary teams, and provision of direct patient care. These essential tasks continue to surpass the present capabilities of LLMs.
In conclusion, ChatGPT using GPT-3.5 passed two of three examinations consisting of multiple-choice questions from the Brazilian radiology and medical imaging board examination, provided that the prompt style was optimized. ChatGPT using GPT-4 passed all three board examinations in every prompt style, but its accuracy could be further enhanced by using specific prompt strategies. The prompt style is a crucial factor determining response accuracy, highlighting the importance of prompt engineering. In subsequent research, there is potential for comprehensive evaluation of multimodal LLMs such as Med-PaLM M, employing examinations that incorporate both visual and textual elements within the query parameters.
Author Contributions
Author contributions: Guarantors of integrity of entire study, L.C.A., P.E.A.K., F.C.K.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, L.C.A., E.M.J.M.F., N.A., F.C.K.; experimental studies, L.C.A., N.A.; statistical analysis, L.C.A., N.A.; and manuscript editing, all authors
* N.A. and F.C.K. are co−senior authors.
Authors declared no funding for this work.
References
- 1. . Left Atrial Volume as a Biomarker of Atrial Fibrillation at Routine Chest CT: Deep Learning Approach. Radiol Cardiothorac Imaging 2019;1(5):e190057.
- 2. . Visual Transformers and Convolutional Neural Networks for Disease Classification on Radiographs: A Comparison of Performance, Sample Efficiency, and Hidden Stratification. Radiol Artif Intell 2022;4(6):e220012.
- 3. . AI Improves Nodule Detection on Chest Radiographs in a Health Screening Population: A Randomized Controlled Trial. Radiology 2023;307(2):e221894.
- 4. . High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25(1):44–56.
- 5. . An Extensive Study on Pretrained Models for Natural Language Processing Based on Transformers. In:
2022 International Conference on Electronics and Renewable Systems (ICEARS) ,Tuticorin, India ,March 16–18, 2022 . IEEE, 2022; 382–389. - 6. . Scaling Laws for Neural Language Models. arXiv 2001.08361 [preprint] https://arxiv.org/abs/2001.08361. Published January 23, 2020. Accessed March 16, 2023.
- 7. . Compute Trends Across Three Eras of Machine Learning. arXiv 2202.05924 [preprint] https://arxiv.org/abs/2202.05924. Published February 11, 2022. Accessed March 16, 2023.
- 8. . What learning algorithm is in-context learning? Investigations with linear models. arXiv 2211.15661 [preprint] https://arxiv.org/abs/2211.15661. Published November 28, 2022. Accessed March 16, 2023.
- 9. . Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. arXiv 2212.10559 [preprint] https://arxiv.org/abs/2212.10559. Published December 20, 2022. Accessed March 16, 2023.
- 10. . Language Models are Few-Shot Learners. arXiv 2005.14165 [preprint] https://arxiv.org/abs/2005.14165. Published May 28, 2020. Accessed March 16, 2023.
- 11. . Large Language Models are Zero-Shot Reasoners. arXiv 2205.11916 [preprint] https://arxiv.org/abs/2205.11916. Published May 24, 2022. Accessed March 16, 2023.
- 12. . Large Language Models Are Human-Level Prompt Engineers. In:
NeurIPS 2022 Foundation Models for Decision Making Workshop . https://openreview.net/forum?id=YdqwNaCLCx. - 13. . What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv 2009.13081 [preprint] https://arxiv.org/abs/2009.13081. Published September 28, 2020. Accessed March 16, 2023.
- 14. . MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In: Flores G, Chen GH, Pollard T, Ho JC, Naumann T, eds.
Proceedings of the Conference on Health, Inference, and Learning . PMLR, 2022; 248–260. https://proceedings.mlr.press/v174/pal22a.html. - 15. . Large Language Models Encode Clinical Knowledge. arXiv 2212.13138 [preprint] https://arxiv.org/abs/2212.13138. Published December 26, 2022. Accessed March 16, 2023.
- 16. . ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs 2023;22(7):e55–e59.
- 17. . How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9:e45312.
- 18. . ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023;307(2):e230163.
Article History
Received: Apr 1 2023Revision requested: May 17 2023
Revision received: Sept 6 2023
Accepted: Oct 23 2023
Published online: Nov 08 2023