Back to the list
Congress: ECR25
Poster Number: C-24137
Type: Poster: EPOS Radiologist (scientific)
Authorblock: D. Männle, M. Langhals, N. Santhanam, C. G. Cho, H. Wenz, C. Groden, F. Siegel, M. E. Maros; Mannheim/DE
Disclosures:
David Männle: Nothing to disclose
Martina Langhals: Nothing to disclose
Nandhini Santhanam: Nothing to disclose
Chang Gyu Cho: Nothing to disclose
Holger Wenz: Nothing to disclose
Christoph Groden: Nothing to disclose
Fabian Siegel: Nothing to disclose
Máté Elöd Maros: Consultant: Non-related consultancy EppData GmbH Consultant: Non-related consultancy Siemens Healthineers AG
Keywords: Artificial Intelligence, Computer applications, Neuroradiology brain, CT, CT-Angiography, RIS, Computer Applications-General, Technology assessment, Ischaemia / Infarction
Purpose Large language models (LLMs) have been widely recognized for their ability to encode clinical knowledge and effectively summarize medical texts. However, despite these capabilities, there is still a lack of comprehensive understanding regarding the optimal strategies for prompting or fine-tuning these models for specific tasks, particularly when dealing with non-English corpora. To address this gap, we conducted a systematic investigation into various in-context learning (ICL) strategies. Our study focused on evaluating a broad and diverse set of state-of-the-art open-source large...
Read more Methods and materials For this study a random cohort of 206 German stroke CT reports was retrieved from local RIS/PACS (01/2015-12/2023). A stratified random split (90%-10%) using the ASPECT score [y|n], CTA/CTP [y|n] and sex[F|M] was performed to create training-validation (n=185) and test sets (n=21). The former was further stratified into training (n=160) and validation (n=25) sets. ICL-approaches were compared using 10 runs of a 0-shot- (validation fit/run only), 1-, 5-, and 10-shot disjunct training examples (n_runs=1+30;n_sum_examples=160). State-of-the-art OS-LLMs (n=8) including gemma2[:2b;9b], llama3.1[:8b;70b]|-3.2[3b],...
Read more Results Out of 206 cases, 99 were female (48.0%; median_age: F=79.7 vs. M=73.2; range=21.7-95.9yrs; p=7.5x10^-4). Non-contrast cranial CT was performed in 155 (75.2%) remaining also received CTA (n=47;22.8%) and/or CTP (n=21;10.2%). The median word count of findings and impressions were 142 (range=5-473) and 30 (range=5-184), respectively; resulting in ~600-800 tokens/report. Thus, limiting maximal context length to ~10-12 reports for compatibility with older-generation LLMs. Overall, 12400 (8x2x31x25) configurations of LLM-ICL-strategies were validated. The overall best test-performance was shown by llama3.1[:70b] and mixtral[8x7b].
Read more Conclusion The performance of open-source large language models (OS-LLMs) showed a clear and consistent improvement as the size of the model increased, as well as when a more recent iteration or version of the model was used. Additionally, providing a greater number of in-context learning samples, specifically in a 10-shot setting, further contributed to enhanced performance. Furthermore, role and task definitions that were formulated in the English language consistently led to better results compared to those that were presented in German.
Read more References Brown, Tom B., et al. “Language Models Are Few-Shot Learners.” arXiv.org, 28 May 2020, https://doi.org/10.48550/arXiv.2005.14165.Li, C., et al. "Large Language Models Understand and Can be Enhances by Emotional Stimuli." arXiv, 14 July 2023, https://doi.org/10.48550/arXiv.2307.11760. Lochan, Baysal: "Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models." arXiv, 17 October 2023, https://doi.org/10.48550/arXiv.2310.10449. López-Úbeda, P., et al.: "Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study." International journal of medical informatics...
Read more