Back to the list
Congress: ECR25
Poster Number: C-28099
Type: Poster: EPOS Radiologist (scientific)
Authorblock: D. Pisarcik, M. Kissling, J. Heimer, R. Kubik-Huch, A. Euler; Baden/CH
Disclosures:
Dusan Pisarcik: Other: Sirius Medical
Marc Kissling: Nothing to disclose
Jakob Heimer: Nothing to disclose
Rahel Kubik-Huch: Nothing to disclose
Andre Euler: Speaker: Siemens
Keywords: Artificial Intelligence, Mammography, Ultrasound, Biopsy, Cancer, Neoplasia
Results

Pairwise comparisons derived from the model across question categories showed that the odds of receiving a higher rating were significantly greater for ChatGPT-4 and ChatGPT-4o than for Google Gemini (p < .001 for all comparisons).

Fig 2: In BI-RADS, while ChatGPT-4 still outperformed Google Gemini (p = 0.001), its difference from ChatGPT-4o was not statistically significant (p = 0.526).
Table 1: Pairwise comparisons across BI-RADS categories revealed that ChatGPT-4 was rated significantly higher than both ChatGPT-4o and Google Gemini (p < .001 for all contrasts).

The Plackett-Luce Model indicated that ChatGPT-4o had the highest probability of being preferred (Probability = 0.48), followed by ChatGPT-4 (Probability = 0.37), and Google Gemini had the lowest probability of preference (Probability = 0.15).  Most participants ranked ChatGPT4o and ChatGPT4 first, while Google Gemini was rarely ranked highest.

Fig 3: Pairwise comparisons showed that Google Gemini was significantly less preferred than both ChatGPT-4 (p < .001) and ChatGPT-4o (p < .001), while the difference between ChatGPT-4 and ChatGPT-4o was not significant (p = 0.14). Distribution Rank distribution analysis showed that ChatGPT-4o was most frequently assigned the highest rank, followed by ChatGPT-4, and Google Gemini was predominantly ranked lower across all categories.

Our findings demonstrate a strong preference for AI-translated texts compared to the original reports by radiologists.

Fig 4: Aggregated over all BI-RADS categories (i.e., 120 responses per AI per question category), the two most favorable options in the Procedure category—“better” and “much better”—were chosen 34 and 73 times respectively for ChatGPT-4 (a total of 107 favorable responses, about 89%), 47 and 64 times for ChatGPT-4o (111 favorable responses, approximately 93%), and 62 and 34 times for Google Gemini (96 favorable responses, or 80%).
Table 2: Distribution of selected option by question category and AI. 5-point grading scale: 1 = significantly worse, 2 = worse, 3 = about the same, 4 = better and 5 = significantly better.

Secondary covariates showed that study participants of age group 3 and 4, 50 – 69 years old and more than 70 years old respectively, had significantly higher odds for assigning higher ratings for AI-translated reports compared to those in age group 1 (18–29 years).

Fig 5: Significantly higher odds for assigning higher ratings of the AI translations were also seen in patients with no prior mammography experience. Participants with secondary and tertiary education (based on Swiss school system) showed significantly lower odds for assigning higher ratings compared to those with primary education. No statistical significance was observed in the age group 2 (30 – 49 years old).
Table 3: Demographic Characteristics of the Study Population (N = 40).

On the last page of the questionnaire, the following question was asked: In your opinion, could AI-based text versions of radiological findings be further pursued for patients? Of the 40 study participants:

  • 27 answered "yes, definitely"
  • 11 answered "rather yes"
  • 1 participant was undecided
  • 0 participants were against it
  • 1 participant did not answer this question

GALLERY