A total of 104 image-based multiple-choice questions (MCQs) from Radiopaedia.org were analyzed, covering various neuroradiological imaging modalities, including CT, MRI, DSA, and conventional radiography (Fig. 1). Each question was presented five times to two large language models (LLMs) to ensure consistent evaluation. Additionally, global response data from Radiopaedia.org were collected as a comparative benchmark [8].
The study included both human participants and artificial intelligence models. Among the human participants were three expert radiologists (experienced neuroradiologists from an university hospital with over three years of clinical practice) and two trainee radiologists from a general radiology residency program. The large language models assessed were GPT-4 (OpenAI), a multimodal system capable of interpreting both text and images [3], and Gemini 1.5 (Google), another advanced multimodal model with similar capabilities [4]. The LLMs were evaluated for factual accuracy and were tasked with classifying the difficulty of each question on a Likert scale (Fig. 3).
Performance metrics focused primarily on accuracy, defined as the percentage of correct answers provided by both LLMs and human participants. The difficulty classification of each question was also analyzed, comparing how the LLMs' ratings aligned with Likert scale classifications derived from global response data. Aggregated global response data from Radiopaedia.org users served as an additional reference for understanding performance variability across a broader cohort [8].