Back to the list
Congress: ECR25
Poster Number: C-17761
Type: Poster: EPOS Radiologist (scientific)
DOI: 10.26044/ecr2025/C-17761
Authorblock: N. Vrancsik, V. Shabani, M. Allaria, P-A. A. Poletti, K-O. Loevblad, F. Kurz, M. Scheffler; Geneva/CH
Disclosures:
Nóra Vrancsik: Nothing to disclose
Venera Shabani: Nothing to disclose
Marie Allaria: Nothing to disclose
Pierre-Alexandre Aloïs Poletti: Nothing to disclose
Karl-Olof Loevblad: Nothing to disclose
Felix Kurz: Nothing to disclose
Max Scheffler: Nothing to disclose
Keywords: eHealth, Experimental, Computer Applications-Detection, diagnosis, Workforce
Methods and materials

Study design

Fig 1: Study design

A total of 54 multiple choice questions (QCM) of the “single select” type was retrieved from Radiopaedia.org’s quiz section, with permission from the Editor. The dataset was composed of an equal number of questions published before April 2023, and from then on (2x27 questions; the cutoff coincident with the end of GPT’s training phase). The questions were chosen across different subspecialties including gastroenterology, oncology, and gynaecology/paediatrics. Each question provided clinical context and 4-5 response options

Table 1: Quiz questions characteristics

Four readers answered the questions: two junior residents and two board-certified radiologists. Readers were blinded to the others’ answers and responded independently within a 90-second time limit per question, without access to external resources.

The two LLM with image processing capabilities that were tested:GPT-4o (OpenAI, San Francisco, CA, USA) with and without internet access in June 2024, and with internet access in September 2024Claude 3.5 Sonnet (Anthropic, San Francisco, CA, USA), an LLM that inherently does not have internet access, in September 2024

LLM test protocol:1. Initial assessment (July 2024): Performance evaluation of GPT-4o online and offline versions' performance on pre-April 2023 Radiopaedia questions and questions posted thereafter, to assess the influence training data might have had in the training phase of the model. An identical template was used for prompting for all 54 questions requesting answer whilst transmitting the text and relevant image.2. Comprehensive assessment (September 2024) AI vs. AI: Compared the then updated GPT-4o LLM with Claude 3.5 Sonnet in performing on the same list of quiz questions. In this second phase we developed two templates with additional questions, to gain further insight on the mechanisms of AI’s decision-making In addition to resubmitting the original template, we added (1) a template for the isolated imaged analysis, asking the machine to name the modality, anatomical structures and pathology on the image; (2) a template submitting the question text only, asking for an answer based only on the MCQ text partAll AI evaluations were performed with zero-shot prompting, implying the opening of a new chat session per question, to minimize memory retention bias.

Statistical analyses:ANOVA was used to search for significant differences between answers to the same question when using the three templates (text+image, image-only, text-only).Chi-squared test was applied determining if there was a significant association between variables, using contingency tables.

GALLERY