Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases
Abstract
Textual descriptions of radiologic image findings play a critical role in GPT-4 with vision–based differential diagnosis, underlining the importance of radiologist experts even in multimodal large language models.
Background
Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood.
Purpose
To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI’s GPT-4 with vision (GPT-4V)–based brain MRI differential diagnosis.
Materials and Methods
Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ2 test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance.
Results
The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; P < .001), as well as a moderate positive effect of H (OR, 4.18; P < .001).
Conclusion
The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance.
© RSNA, 2025
References
- 1. . Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports. Jpn J Radiol 2024;42(2):190–200.
- 2. . MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv 2311.13668 [preprint] https://arxiv.org/abs/2311.13668. Posted November 22, 2023. Accessed January 14, 2024.
- 3. . Evaluating GPT4 on impressions generation in radiology reports. Radiology 2023;307(5):e231259.
- 4. . Diagnostic performance of ChatGPT from patient history and imaging findings on the Diagnosis Please quizzes. Radiology 2023;308(1):e231040.
- 5. . Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: perspectives of four large language models. Indian J Radiol Imaging 2023;34(2):269–275.
- 6. . Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 2023;308(1):e231167.
- 7. . Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66(1):73–79.
- 8. . A Context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology 2023;308(1):e230970.
- 9. . GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study. Radiology 2023;307(5):e230877.
- 10. . Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 2023;307(4):e230725.
- 11. . Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform 2024;183:105321.
- 12. . Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis. arXiv 2310.09909 [preprint] https://arxiv.org/abs/2310.09909. Posted October 15, 2023. Accessed February 22, 2024.
- 13. . A Comprehensive Study of GPT-4V’s Multimodal Capabilities in Medical Imaging. medRxiv 2023.11.03.23298067 [preprint] https://doi.org/10.1101/2023.11.03.23298067. Published November 4, 2023. Accessed February 2024.
- 14. . Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med Educ 2024;10(1):e54393.
- 15. . Performance of Multimodal GPT-4V on USMLE with image: potential for imaging diagnostic support with explanations. medRxiv 2023.10.26.23297629 [preprint] https://doi.org/10.1101/2023.10.26.23297629. Posted November 15, 2023. Accessed February 2024.
- 16. . ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur Radiol 2024.
- 17. . Evaluating the Multimodal Capabilities of Generative AI in Complex Clinical Diagnostics. medRxiv 2023.11.01.23297938 [preprint] https://doi.org/10.1101/2023.11.01.23297938. Posted November 2, 2023. Accessed February 2024.
- 18. . Assessing GPT-4 Multimodal Performance in Radiological Image Analysis. medRxiv 2023.11.15.23298583 [preprint] https://doi.org/10.1101/2023.11.15.23298583. Posted May 23, 2024. Accessed February 2024.
- 19. . CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. arXiv 2310.18341 [preprint] https://arxiv.org/abs/2310.18341. Posted October 22, 2023. Accessed February 23, 2024.
- 20. . Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V. arXiv 2310.19061 [preprint] https://arxiv.org/abs/2310.19061. Posted October 29, 2023. Accessed March 5, 2024.
- 21. . Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol 2024;54(10):1729–1737.
- 22. . GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf. Published September 25, 2023. Accessed November 29, 2024.
- 23. . When vision meets reality: exploring the clinical applicability of GPT-4 with vision. Clin Imaging 2024;108:110101.
- 24. . Human-AI Collaboration in Large Language Model-Assisted Brain MRI Differential Diagnosis: A Usability Study. medRxiv 2024.02.05.24302099 [preprint] https://doi.org/10.1101/2024.02.05.24302099. Posted February 6, 2024. Accessed February 2024.
- 25. . Nonparametric pairwise multiple comparisons in independent groups using Dunn’s test. Stata J 2015;15(1):292–300.
- 26. . Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 1995;57(1):289–300.
- 27. . A mixed-effects multinomial logistic regression model. Stat Med 2003;22(9):1433–1446.
- 28. . The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur Radiol 2024;34(10):6652–6666.
- 29. . Knowledge Injection to Counter Large Language Model (LLM) Hallucination. In: Pesquita C, Skaf-Molli H, Efthymiou V, , eds. The Semantic Web: ESWC 2023 Satellite Events. ESWC 2023. Lecture Notes in Computer Science, vol 13998. Springer, 2023; 182–185.
- 30. . Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit Med 2024;7(1):190.
- 31. . Integration of structured reporting into the routine radiological workflow [in German]. Radiologe 2021;61(11):1005–1013.
- 32. . Integrating AI into radiology workflow: levels of research, production, and feedback maturity. J Med Imaging (Bellingham) 2020;7(1):016502.
- 33. . Llama 3 challenges proprietary state-of-the-art large language models in radiology board-style examination questions. Radiology 2024;312(2):e241191.
Article History
Received: Mar 9 2024Revision requested: Apr 25 2024
Revision received: Oct 28 2024
Accepted: Nov 22 2024
Published online: Jan 21 2025