Back to the list
Congress: ECR25
Poster Number: C-25294
Type: Poster: EPOS Radiologist (scientific)
Authorblock: J. F. Ojeda Esparza1, D. Botta1, A. Fitisiori1, C. Santarosa1, M. Pucci1, C. Meinzer2, Y-C. Yun2, K-O. Loevblad1, F. T. Kurz1; 1Geneva/CH, 2Heidelberg/DE
Disclosures:
Jose Federico Ojeda Esparza: Nothing to disclose
Daniele Botta: Nothing to disclose
Aikaterini Fitisiori: Nothing to disclose
Corrado Santarosa: Nothing to disclose
Marcella Pucci: Nothing to disclose
Clara Meinzer: Nothing to disclose
Yeong-Chul Yun: Nothing to disclose
Karl-Olof Loevblad: Nothing to disclose
Felix T Kurz: Nothing to disclose
Keywords: Artificial Intelligence, CNS, Catheter arteriography, CT, MR, Computer Applications-General, Diagnostic procedure, Technology assessment, Education and training, Image verification
Results

Accuracies were calculated for each question across all groups, including neuroradiologists, trainee radiologists, large language models (LLMs), and global response data (Fig. 5). Neuroradiologists achieved the highest accuracy (0.911 ± 0.02), significantly outperforming all other groups. In contrast, LLM-GAG demonstrated the lowest accuracy (0.50 ± 0.03), close to chance level, while LLM-GPT (0.64 ± 0.05) and global response data (0.69 ± 0.18) showed intermediate performance, with comparable accuracy levels. Trainee radiologists had a moderate accuracy rate (0.57 ± 0.04), but their variability was notably higher than that of expert radiologists (Fig. 5).

The global response data represents the success rate based on the total number of attempts for each question included in our study. This information was retrieved from Radiopaedia.org at the time of data collection. Global responses were characterized by high variability, with an average of 1640 ± 1285 attempts per question. This heterogeneity complicated direct comparisons with other groups, as the number of attempts per question varied significantly, resulting in wide confidence intervals and less precision which could introduce a bias due to this variability, limiting its interpretability [8].

We used the Kruskal-Wallis test for the analysis of non-normally distributed data, resulting in an H-statistic of 355.48 and a p-value of 1.15 × 10⁻⁷⁵, confirming a significant difference between groups (p < 0.005). Post-hoc analysis with Holm’s correction showed that neuroradiologists had significantly better performance compared to all other groups (p < 0.001). LLM-GPT and global response data showed no significant difference between their performance, while both groups significantly outperformed LLM-GAG (p < 0.001). Trainee radiologists also showed significantly different accuracy compared to all other groups (p < 0.001), except for LLM-GPT, where no significant difference was found (Figs. 6).

In terms of difficulty classification, the average Likert scale ratings for question difficulty were 2.86 ± 0.75 for LLM-GPT and 3.25 ± 0.53 for LLM-GAG. The classification derived from global accuracy data showed an average score of 2.71 after mapping global response accuracy rates to corresponding values on a Likert scale (Fig. 7). Spearman's rho (or Spearman's rank correlation coefficient) was used to determine the correlation between the Likert scale ratings of LLM-GPT and LLM-GAG and the mapped values of the global response. Correlation analysis revealed weak and non-significant relationships between the LLMs’ perceived difficulty and the actual global accuracy rate (Spearman’s rho for GPT = -0.048, p = 0.629; for Gemini = -0.016, p = 0.872). This suggests that while the LLMs tended to rate questions as more difficult on average, their assessments did not strongly align with the empirical difficulty derived from global response data[8].

 

GALLERY