ESR Member Area

Back to the list

Congress: ECR25

Poster Number: C-17761

Type: Poster: EPOS Radiologist (scientific)

DOI: 10.26044/ecr2025/C-17761

Authorblock: N. Vrancsik, V. Shabani, M. Allaria, P-A. A. Poletti, K-O. Loevblad, F. Kurz, M. Scheffler; Geneva/CH

Disclosures:

Nóra Vrancsik: Nothing to disclose

Venera Shabani: Nothing to disclose

Marie Allaria: Nothing to disclose

Pierre-Alexandre Aloïs Poletti: Nothing to disclose

Karl-Olof Loevblad: Nothing to disclose

Felix Kurz: Nothing to disclose

Max Scheffler: Nothing to disclose

Keywords: eHealth, Experimental, Computer Applications-Detection, diagnosis, Workforce

Methods and materials

Study design

Fig 1: Study design

A total of 54 multiple choice questions (QCM) of the “single select” type was retrieved from Radiopaedia.org’s quiz section, with permission from the Editor. The dataset was composed of an equal number of questions published before April 2023, and from then on (2x27 questions; the cutoff coincident with the end of GPT’s training phase). The questions were chosen across different subspecialties including gastroenterology, oncology, and gynaecology/paediatrics. Each question provided clinical context and 4-5 response options

Table 1: Quiz questions characteristics

Four readers answered the questions: two junior residents and two board-certified radiologists. Readers were blinded to the others’ answers and responded independently within a 90-second time limit per question, without access to external resources.

The two LLM with image processing capabilities that were tested:GPT-4o (OpenAI, San Francisco, CA, USA) with and without internet access in June 2024, and with internet access in September 2024Claude 3.5 Sonnet (Anthropic, San Francisco, CA, USA), an LLM that inherently does not have internet access, in September 2024

LLM test protocol:1. Initial assessment (July 2024): Performance evaluation of GPT-4o online and offline versions' performance on pre-April 2023 Radiopaedia questions and questions posted thereafter, to assess the influence training data might have had in the training phase of the model. An identical template was used for prompting for all 54 questions requesting answer whilst transmitting the text and relevant image.2. Comprehensive assessment (September 2024) AI vs. AI: Compared the then updated GPT-4o LLM with Claude 3.5 Sonnet in performing on the same list of quiz questions. In this second phase we developed two templates with additional questions, to gain further insight on the mechanisms of AI’s decision-making In addition to resubmitting the original template, we added (1) a template for the isolated imaged analysis, asking the machine to name the modality, anatomical structures and pathology on the image; (2) a template submitting the question text only, asking for an answer based only on the MCQ text partAll AI evaluations were performed with zero-shot prompting, implying the opening of a new chat session per question, to minimize memory retention bias.

Statistical analyses:ANOVA was used to search for significant differences between answers to the same question when using the three templates (text+image, image-only, text-only).Chi-squared test was applied determining if there was a significant association between variables, using contingency tables.

GALLERY