Back to the list
Congress: ECR25
Poster Number: C-11239
Type: Poster: EPOS Radiologist (scientific)
Authorblock: S. H. Kim, S. Schramm, L. C. Adams, R. Braren, K. Bressem, M. Keicher, C. Zimmer, D. M. Hedderich, B. Wiestler; München/DE
Disclosures:
Su Hwan Kim: Nothing to disclose
Severin Schramm: Nothing to disclose
Lisa C. Adams: Nothing to disclose
Rickmer Braren: Nothing to disclose
Keno Bressem: Nothing to disclose
Matthias Keicher: Nothing to disclose
Claus Zimmer: Nothing to disclose
Dennis M Hedderich: Nothing to disclose
Benedikt Wiestler: Nothing to disclose
Keywords: Artificial Intelligence, CT, MR, Technology assessment, Pathology
Purpose Recent studies have demonstrated the potential of large language models (LLMs), artificial intelligence-based systems capable of processing and natural language, to perform radiological differential diagnosis [1, 2]. Yet, the LLMs primarily used in previous studies are proprietary, closed-source models, such as GPT-4, or Gemini [3]. Access to these models typically necessitates the transfer of data to third-party servers, thereby increasing the risk of unauthorized access of sensitive health information. Open-source models offer a viable alternative enabling care institutions to retain...
Read more Methods and materials We evaluated the diagnostic performance of fifteen state-of-the-art open-source LLMs and one closed-source LLM (GPT-4o) using clinical and imaging descriptions from 1,933 case reports in the Eurorad library. Cases spanned all radiological subspecialties and excluded those with explicit mentioning of the correct diagnoses in the case description. Responses were considered correct if the true diagnosis was included in the top three suggestions. Llama-3-70B evaluated responses, with its accuracy validated against radiologist ratings in a case subset (n = 140). Confidence...
Read more Results The initial dataset retrieved from the Eurorad library consisted of 4,827 case reports. Using the Llama-3-70B model, we identified and excluded 2,894 cases where the diagnosis was explicitly stated within the case description. The dataset was primarily composed of cases from neuroradiology (21.4%), abdominal imaging (18.1%), and musculoskeletal imaging (14.6%), whereas breast imaging (3.4%) and interventional radiology (1.4%) were underrepresented.Llama-3-70B exhibited a high accuracy of 87.8% in classifying LLM responses as "correct" or "incorrect" (LLM judge), compared to human expert...
Read more Conclusion Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging real-world cases. The top-performing open-source model, Llama-3, delivered results nearly on par with human experts and GPT-4o, demonstrating that open-source models are rapidly narrowing the gap with proprietary counterparts.
Read more References [1] Sonoda, Y. et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol 1–5 (2024) doi:10.1007/S11604-024-01619-Y.[2] Schramm, S. et al. Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases. medRxiv 2024.03.05.24303767 (2024) doi:10.1101/2024.03.05.24303767.[3] Suh, P. S. et al. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 312, e240273 (2024).
Read more
GALLERY