ESR Member Area

Back to the list

Congress: ECR25

Poster Number: C-25294

Type: Poster: EPOS Radiologist (scientific)

Authorblock: J. F. Ojeda Esparza¹, D. Botta¹, A. Fitisiori¹, C. Santarosa¹, M. Pucci¹, C. Meinzer², Y-C. Yun², K-O. Loevblad¹, F. T. Kurz¹; ¹Geneva/CH, ²Heidelberg/DE

Disclosures:

Jose Federico Ojeda Esparza: Nothing to disclose

Daniele Botta: Nothing to disclose

Aikaterini Fitisiori: Nothing to disclose

Corrado Santarosa: Nothing to disclose

Marcella Pucci: Nothing to disclose

Clara Meinzer: Nothing to disclose

Yeong-Chul Yun: Nothing to disclose

Karl-Olof Loevblad: Nothing to disclose

Felix T Kurz: Nothing to disclose

Keywords: Artificial Intelligence, CNS, Catheter arteriography, CT, MR, Computer Applications-General, Diagnostic procedure, Technology assessment, Education and training, Image verification

Methods and materials

A total of 104 image-based multiple-choice questions (MCQs) from Radiopaedia.org were analyzed, covering various neuroradiological imaging modalities, including CT, MRI, DSA, and conventional radiography (Fig. 1). Each question was presented five times to two large language models (LLMs) to ensure consistent evaluation. Additionally, global response data from Radiopaedia.org were collected as a comparative benchmark [8].

The study included both human participants and artificial intelligence models. Among the human participants were three expert radiologists (experienced neuroradiologists from an university hospital with over three years of clinical practice) and two trainee radiologists from a general radiology residency program. The large language models assessed were GPT-4 (OpenAI), a multimodal system capable of interpreting both text and images [3], and Gemini 1.5 (Google), another advanced multimodal model with similar capabilities [4]. The LLMs were evaluated for factual accuracy and were tasked with classifying the difficulty of each question on a Likert scale (Fig. 3).

Performance metrics focused primarily on accuracy, defined as the percentage of correct answers provided by both LLMs and human participants. The difficulty classification of each question was also analyzed, comparing how the LLMs' ratings aligned with Likert scale classifications derived from global response data. Aggregated global response data from Radiopaedia.org users served as an additional reference for understanding performance variability across a broader cohort [8].

GALLERY