Back to the list
Congress: ECR25
Poster Number: C-25294
Type: Poster: EPOS Radiologist (scientific)
Authorblock: J. F. Ojeda Esparza1, D. Botta1, A. Fitisiori1, C. Santarosa1, M. Pucci1, C. Meinzer2, Y-C. Yun2, K-O. Loevblad1, F. T. Kurz1; 1Geneva/CH, 2Heidelberg/DE
Disclosures:
Jose Federico Ojeda Esparza: Nothing to disclose
Daniele Botta: Nothing to disclose
Aikaterini Fitisiori: Nothing to disclose
Corrado Santarosa: Nothing to disclose
Marcella Pucci: Nothing to disclose
Clara Meinzer: Nothing to disclose
Yeong-Chul Yun: Nothing to disclose
Karl-Olof Loevblad: Nothing to disclose
Felix T Kurz: Nothing to disclose
Keywords: Artificial Intelligence, CNS, Catheter arteriography, CT, MR, Computer Applications-General, Diagnostic procedure, Technology assessment, Education and training, Image verification
Methods and materials

A total of 104 image-based multiple-choice questions (MCQs) from Radiopaedia.org were analyzed, covering various neuroradiological imaging modalities, including CT, MRI, DSA, and conventional radiography (Fig. 1). Each question was presented five times to two large language models (LLMs) to ensure consistent evaluation. Additionally, global response data from Radiopaedia.org were collected as a comparative benchmark [8].

The study included both human participants and artificial intelligence models. Among the human participants were three expert radiologists (experienced neuroradiologists from an university hospital with over three years of clinical practice) and two trainee radiologists from a general radiology residency program. The large language models assessed were GPT-4 (OpenAI), a multimodal system capable of interpreting both text and images [3], and Gemini 1.5 (Google), another advanced multimodal model with similar capabilities [4]. The LLMs were evaluated for factual accuracy and were tasked with classifying the difficulty of each question on a Likert scale (Fig. 3).

Performance metrics focused primarily on accuracy, defined as the percentage of correct answers provided by both LLMs and human participants. The difficulty classification of each question was also analyzed, comparing how the LLMs' ratings aligned with Likert scale classifications derived from global response data. Aggregated global response data from Radiopaedia.org users served as an additional reference for understanding performance variability across a broader cohort [8].

 

GALLERY