Accuracy of Large Language Model–based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports

Figure 1: Study flowchart.

Figure 2: Large language model (LLM)–based strategies. To automatically calculate Ovarian-Adnexal Reporting and Data System (O-RADS) MRI scores from report descriptions, two different LLM-based strategies were evaluated. The first involved using GPT-4 (OpenAI) prompted with O-RADS MRI system rules and two examples (few-shot learning) to assign a score to each lesion (LLM only). The second involved leveraging GPT-4 with few-shot learning to extract and classify key descriptions (ie, septations and solid enhancing tissue), and output a structured JSON object for each lesion (hybrid). These features were then automatically passed into a deterministic formula to apply complex system rules and calculate the final O-RADS MRI score. O-RADS MRI categorizes risk into one of five scores: 1 (normal), 2 (almost certainly benign; positive predictive value [PPV] of malignancy, <0.5%), 3 (low risk; PPV, approximately 5%), 4 (intermediate risk; PPV, approximately 50%), and 5 (high risk; PPV, approximately 90%).

Figure 3: An 82-year-old woman presented with a right adnexal cystic mass that required further evaluation at MRI. This right adnexal mass was described in the original MRI report as containing enhancing septations but no papillary projections or nodularity. The reference standard category for this lesion was Ovarian-Adnexal Reporting and Data System (O-RADS) MRI 3. The original MRI report assigned an O-RADS MRI score of 4, which was incorrect. Both the large language model (LLM)–only (GPT-4 with in-context learning; OpenAI) and the hybrid application (GPT-4 for feature classification followed by deterministic formula to calculate the O-RADS score) correctly classified the lesion as O-RADS MRI 3. O-RADS MRI categorizes risk into one of five scores: 1 (normal), 2 (almost certainly benign; positive predictive value [PPV] of malignancy, <0.5%), 3 (low risk; PPV, approximately 5%), 4 (intermediate risk; PPV, approximately 50%), and 5 (high risk; PPV, approximately 90%).

Figure 4: Accuracy of Ovarian-Adnexal Reporting and Data System (O-RADS) MRI scores compared with the reference standard review for lesions with an O-RADS score in the original report (n = 158). The hybrid model (GPT-4 for feature classification [OpenAI] followed by deterministic formula to calculate O-RADS score) accuracy (97%; 153 of 158) was higher than original report scores (88%; 139 of 158; P = .004) and large language model [LLM] only (GPT-4 with in-context-learning [OpenAI]; 89%; 141 of 158; P = .01). O-RADS MRI categorizes risk into one of five scores: 1 (normal), 2 (almost certainly benign; positive predictive value [PPV] of malignancy, <0.5%), 3 (low risk; PPV, approximately 5%), 4 (intermediate risk; PPV, approximately 50%), and 5 (high risk; PPV, approximately 90%).