Expert neuroradiologists outperformed all other groups. However, global response data appeared more accurate than that of residents, possibly due to variability in response volume. LLM-GPT achieved higher accuracy than residents, while LLM-GAG showed the lowest performance, emphasizing the need for further improvements in AI tools. These findings are consistent with previous studies evaluating the accuracy of large language models (LLMs) in medical question-answering tasks [5–7]. In this study, the inclusion of images in the questions did not appear to affect their success rate, either positively or negatively.
The performance difference between LLM-GPT and LLM-GAG may be attributed to the longer development period of LLM-GPT compared to LLM-GAG, which is a more recent release. This time difference likely impacts the models' maturity and optimization.
It is important to note that the small sample size of experts and residents may limit the generalizability of these results.