The LLM Will See You Now: Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations

Published Online:https://doi.org/10.1148/ryai.230568

See also the article by Almeida et al in this issue.

Hari Trivedi, MD, is an assistant professor at Emory University in                     emergency radiology, biomedical informatics, and emergency medicine. He                     co-directs the HITI (Healthcare AI Innovation and Translational Informatics) lab                     with a focus on pipelines for data extraction, de-identification, and curation                     for development of large-scale deep learning and multimodal foundational models.                     He works closely with industry partners on model validation, fine-tuning, and                     regulatory clearance, with a primary focus on breast imaging. Dr Trivedi also                     serves as the chair of the RSNA Radiology Artificial Intelligence Data Standards                     (RAIDS) Subcommittee.

Hari Trivedi, MD, is an assistant professor at Emory University in emergency radiology, biomedical informatics, and emergency medicine. He co-directs the HITI (Healthcare AI Innovation and Translational Informatics) lab with a focus on pipelines for data extraction, de-identification, and curation for development of large-scale deep learning and multimodal foundational models. He works closely with industry partners on model validation, fine-tuning, and regulatory clearance, with a primary focus on breast imaging. Dr Trivedi also serves as the chair of the RSNA Radiology Artificial Intelligence Data Standards (RAIDS) Subcommittee.

Judy Wawira Gichoya, MD, MS, is an associate professor at Emory University                     in interventional radiology and informatics, co-directing the HITI (Healthcare                     AI Innovation and Translational Informatics) lab. Her career focus is on                     validating machine learning models for health in real clinical settings,                     exploring explainability and fairness, with a specific focus on how algorithms                     fail. She is heavily invested in training the next generation of data scientists                     through multiple high school programs, serving as the program director for the                     Radiology: Artificial Intelligence trainee editorial board and the medical                     students machine learning elective.

Judy Wawira Gichoya, MD, MS, is an associate professor at Emory University in interventional radiology and informatics, co-directing the HITI (Healthcare AI Innovation and Translational Informatics) lab. Her career focus is on validating machine learning models for health in real clinical settings, exploring explainability and fairness, with a specific focus on how algorithms fail. She is heavily invested in training the next generation of data scientists through multiple high school programs, serving as the program director for the Radiology: Artificial Intelligence trainee editorial board and the medical students machine learning elective.

Historically, becoming a physician has been considered a higher calling, requiring both knowledge of the human body and the compassion to care for the whole person. In most cases, access is limited to only the brightest and most resilient individuals who are able to endure years of rigorous education and training. The barriers to entry are numerous, including the ability to perform well on standardized testing; the Scholastic Aptitude Test (SAT), Medical College Admission Test (MCAT), and U.S. Medical Licensing Examination (USMLE) require critical thinking ability combined with a strong ability to memorize and recall information.

In the past 20 years, however, the need to memorize and recall information has decreased substantially with the advent of the internet and smart devices which allow us to access the world’s information in an instant. While standardized testing has never been a fair or perfect measure of aptitude or intelligence, its relevance to modern-day practice and the ability to take care of a patient continues to dwindle. Especially in the era of artificial intelligence, statements from leading technology innovators like Geoffrey Hinton have challenged even the need for doctors, with claims of total replacement by technology (1).

Such advances in technology raise many questions about how physicians should be trained and tested. What if a program can answer questions and pass examinations as well as a physician? What if, eventually, it generates care plans that meet or exceed those of physicians or panels of experts? These questions are not new, but recent advancements in large language models (LLMs) have accelerated the urgency with which we must consider them (2).

In “Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations” (3), Almeida and colleagues test OpenAI’s ChatGPT-3.5 and ChatGPT-4 on three Brazilian board examinations written in Brazilian Portuguese: radiology and diagnostic imaging, neuroradiology, and mammography. These examinations were selected specifically due to their lack of images within the question stems. Various prompting styles, which have been described in the literature (4), resulted in differences in performance for GPT-3.5 but showed no substantial difference for GPT-4, demonstrating the latter’s greater ability to generalize and understand language. In their testing, GPT-4 was able to pass all three board examinations. GPT-3.5 was able to pass only two, and overall, GPT-4 outscored GPT-3.5 by approximately 20% in most cases.

Impressive, Yet Unsurprising

The results are impressive, although perhaps unsurprising given the seemingly endless stream of similar results across multiple health care and non–health care domains. However, we must step back and consider what actually was accomplished to avoid conflating passing an examination with the ability to diagnose or treat patients. By design, standardized examinations have highly structured questions with a single correct answer. The question stem must contain multiple unambiguous clues that guide the examinee to the correct answer, provided that they have the underlying knowledge. Questions must be limited to topics for which the literature supports definitive answers in diagnosis, treatment, or management. Of course, this is not how radiology or any other subspecialty in medicine is practiced. Symptoms are vague, histories are inaccurate, differentials are broad, and further testing and evaluation is required in many cases. Therefore, an LLM’s ability to interpret question stems and draw upon a vast fund of training data should not be equated to intelligence or the ability to care for patients.

How, then, should we interpret these results? The performance of GPT-4 in Almeida et al’s experiment highlights two key achievements: (a) the ability to tokenize human languages into concepts across languages (ie, understand the question) and (b) the ability to recall relevant training data to generate a pertinent answer or result. Each of these tasks is immensely impressive and could not have been considered 1 or 2 years ago.

There is currently very little evidence to show how these models would perform when ingesting real-world multimodal data to generate accurate and actionable outcomes for patient care. One such example is a non–peer-reviewed article that was published by an emergency physician who uploaded 40 de-identified patient histories during a single shift into ChatGPT with the prompt “What are the differential diagnoses for this patient presenting to the emergency department?” (5). His results show that as long as the history was highly detailed, precise, and very long (200–600 words), the model performed very well. However, there was at least one dangerous shortcoming for a 21-year-old female patient with right lower quadrant pain. ChatGPT suggested appendicitis and ovarian cyst but did not suggest ectopic pregnancy, which was the patient’s ultimate diagnosis and could have led to catastrophic results if missed.

Biased Training Data

Like all machine learning models, LLMs are biased toward training data. For example, Zack et al (6) asked ChatGPT multiple times to generate a clinical vignette for a patient presenting with a certain disease, for example, sarcoidosis. They then measured the frequency of the patient’s race described in the clinical vignette and compared it to the actual distribution of that disease in the patient populations. Their results showed very large discrepancies between the actual racial distribution of disease and the expected population distribution. For example, approximately 30% of patients with sarcoidosis are Black and 60% are female, whereas the clinical vignette generated a Black female patient nearly 100% of the time. Similar trends were observed for HIV, essential hypertension, and cancers, suggesting that the differential diagnosis provided for a patient can very likely be influenced by race or sex. Because the source of training data for current LLMs is currently unknown, it is very difficult to account for or counteract these biases. A similar challenge arises if we try to use general models to provide medical guidelines that may vary between countries. If the model is trained predominately on data and literature from the United States, it is likely to provide results that are inaccurate in another country. To generate models that are both accurate and equitable in crucial settings such as health care, training data must be known, and efforts must be made to identify and minimize bias whenever possible.

Perfection Is the Enemy of Progress

Knowing the potential shortcomings and limitations of LLMs in health care, the near-term use cases for LLMs should leverage their known strengths and minimize weaknesses. For example, inputting a specific clinical finding and asking what the current management guidelines support is a simple yet very concrete way to provide both value and time-savings to physicians. Similarly, summarizing the past medical history of a patient in the emergency department would save tremendous time and potentially prevent misses of certain elements from the patient’s history. The “how” of the correct prompt to use, given that LLMs return different results each time, is still unknown and an area of active research. This is one of the strengths of Almeida et al’s article as the performance of four prompting strategies was tested. Surprisingly, the long instruction style led to poor performance, possibly related to language as the examination was in Brazilian Portuguese that likely required translation before retrieving the correct answer.

Multimodality Image and Text Models

Many LLMs such as GPT-3.5 and GPT-4 are now able to process images and text simultaneously. The majority of these models are trained on natural image and caption pairs scraped from the internet. However, a natural extension is the use of these models to interpret radiologic images. Much of the existing literature focuses on different methods for training (for example, contrastive learning [7]) with chest radiographs and reports as the use case. However, these works focus largely on proofs of concept and capabilities rather than real-world use cases. Much work is underway to train large foundational models in both academic (810) and commercial settings with the goal of using a rich embedding space to substantially decrease the volume of data needed to train deep learning models, thereby rapidly accelerating artificial intelligence model development. However, these foundational models still do not address a major limitation in the current framework of development and regulatory clearance of models; these models and clearances are for relatively narrow use cases. A new framework for regulatory clearance would be required to allow testing and deployment of multimodal LLMs that are accurate enough for patient care in an end-to-end fashion.

We remain somewhat skeptical that such solutions could cover all potential use cases in medical imaging. A more likely outcome in the next 5–10 years is that we will see an explosion of development of new artificial intelligence models built upon foundational models and that the barrier to entry for new entrants will be significantly lower. This result may reduce the resources required for model development and allow companies to focus more on user experience which is still sorely lacking in most deployments.

Conclusion

In summary, the successful passing of non-English radiology board examinations is a powerful signpost for the future of artificial intelligence in medicine. However, it is important to recognize the chasm between passing a test and handling complex clinical scenarios. The development of multimodal LLMs specifically for radiology is already underway, with the potential for significant gains in efficiency and accuracy for radiologists. However, this progress brings new challenges such as addressing biases and limitations in training data. As we integrate these technologies into health care, our focus must remain on enhancing patient care, ensuring accuracy, and maintaining ethical standards. The path forward involves not just technologic innovation but also a deep understanding of its implications in the nuanced landscape of medicine.

Disclosures of conflicts of interest: H.T. Grants from Lunit, DeepLook Medical, Kheiron, Clairity, and GE; royalties from Clairity; consulting fees from Sirona Medical, PMX, Flatiron Health, and Arterys; patent for Automatic Segmentation of Breast Arterial Calcification; chair of the RSNA Radiology Artificial Intelligence Data Standards Subcommittee. J.W.G. Supported by the 2022 Robert Wood Johnson Foundation Harold Amos Medical Development Program, the RSNA Health Disparities grant (#EIHD2204), Lacuna Fund (#67), Gordon and Betty Moore Foundation, and National Institutes of Health MIDRC grants (grant nos. 75N92020C00008 and 75N92020C00021); grants from Clairity, Lunit, and DeepLook Medical; board member for HL7 and the Society of Imaging Informatics in Medicine; associate editor and trainee editorial board lead for Radiology: Artificial Intelligence.

Authors declared no funding for this work.

References

  • 1. Hinton G. What’s Next? The Research Frontier- Presented by CIFAR. Presented at the: 2016 Machine Learning and Market for Intelligence Conference; October 27, 2016; Toronto, Canada.
  • 2. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2(2):e0000198.
  • 3. Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations. Radiol Artif Intell 2024;6(1):e230103.
  • 4. Zhou Y, Muresanu AI, Han Z, et al. Large Language Models Are Human-Level Prompt Engineers. arXiv 2211.01910 [preprint] https://arxiv.org/abs/2211.01910. Published November 3, 2022. Updated March 10, 2023. Accessed December 4, 2023.
  • 5. Tamayo-Sarver J; Inflect Health. I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients. Medium. https://inflecthealth.medium.com/im-an-er-doctor-here-s-what-i-found-when-i-asked-chatgpt-to-diagnose-my-patients-7829c375a9da. Published April 5, 2023. Accessed December 4, 2023.
  • 6. Zack T, Lehman E, Suzgun M, et al. Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare. medRxiv 2023.07.13.23292577 [preprint] https://www.medrxiv.org/content/10.1101/2023.07.13.23292577v2. Published July 17, 2023. December 4, 2023.
  • 7. Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. arXiv 2210.10163 [preprint] https://arxiv.org/abs/2210.10163. Published October 18, 2022. Accessed December 4, 2023.
  • 8. Zhang X, Wu C, Zhang Y, Xie W, Wang Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat Commun 2023;14(1):4542.
  • 9. Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards generalist foundation model for radiology. arXiv 2308.02463 [preprint] https://arxiv.org/abs/2308.02463. Published August 4, 2023. Updated November 16, 2023. Accessed December 4, 2023.
  • 10. Wiggins WF, Tejani AS. On the opportunities and risks of foundation models for natural language processing in radiology. Radiol Artif Intell 2022;4(4):e220119.

Article History

Received: Dec 4 2023
Revision requested: Dec 8 2023
Revision received: Dec 12 2023
Accepted: Dec 18 2023
Published online: Jan 31 2024