Background:
A Generative Pre-trained Transformer or ChatGPT, is an artificial intelligence (AI), language model developed by OpenAI, USA. The basic principle on which it works is Transformer architecture, conceptualized and introduced by Vaswani et al in 2017. It is a type of neural network, capable of processing sequential text data. The training of this neural network comprises of an input of extremely large volumes of text data as training material, which renders it capable of generating natural language. However, there always remains a window for repeat training and enhanced capability.
Open-AI’s GPT-4.0 and a previous iteration GPT-3.5, have both been fairly extensively explored for translating reports into regional and simplified language, all aimed towards improved patient comprehension. Interestingly, a recent RSNA publication reported that, GPT-4.0 performed well for imaging-independent interpretations, but poorly for imaging-dependent instances. Google’s Gemini 1.0, is another popular LLM, proposedly capable of Radiology tasks. However, the capability of LLMs in proposing a final Radiological diagnosis and directing further management, from standard structured text of Radiology reports, remains a relatively under explored area, with few countable white papers on this aspect.
The "temperature”, (T), of an LLM, is a parameter which determines the nature of the output; whether it will be constrained or liberal. Setting the temperature too low may limit outputs to very conservative answers, while the opposite may result in “Confabulation”. It is noteworthy that the ideal temperature for the performance of a given LLM, has not yet standardized by the experts in the field, thus, identifying the same would be relevant, especially in terms of radiological applications. An extensive internet search did not yield any publications, reporting the effect of temperature variations on LLMs’ performance for Radiology and imaging applications.
The aims of the present study were to investigate and compare the accuracy of GPT 3.5-Turbo, GPT 4.0-Turbo and Google’s Gemini 1.0, for reliable and accurate interpretation of standard structured text of HRCT-thorax reports, in terms of providing a final radiological diagnosis and also the management advice; while simultaneously applying three different temperature settings, T, 0.11; T, 0.51; and T, 0.91, for both tasks. The end-point of this endeavor was not only to identify the best performing LLM currently available, but also to determine the ideal temperature setting/s for their operation.
We believe our study is probably amongst the pioneer endeavors, which explores two relatively novel aspects for LLM’s applications in Radiology, firstly the accuracy of providing radiological diagnosis and related management advice, secondly whether variable temperature settings for operation, significantly affect their performance, especially in the domain of radiology and imaging applications.
Methods:
This IRB approved, prospective study, was performed on 119, PACS-retrieved, and anonymized, standard-descriptive-text HRCT reports of the last one year. Only descriptive-text was exposed for the prompt, with all remaining sections and patient identity, carefully masked.
The LLMs were prompted as follows: “You are an expert Radiologist, therefore provide: task-1, radiological diagnosis and task-2, management advice; from the provided standard (structured) text of radiology reports.” The performance of all three LLMs were assessed at, T, 0.11; T, 0.51; and T, 0.91.
The performance outputs were scored by an intra-blinded team of four Radiologists, comprised of, a Senior Faculty, a junior faculty, a Fellow and a R2 resident. The Radiology team acted as surrogate Board Examiners and scored the performance of each LLM, from 0 to 100, at the three enumerated temperatures. For task-1, a consensus between the Radiologist’s team was to assign the Radiological Diagnosis Scores (RDS) as follows: an accurate-diagnosis, 100; a relatively relevant differential was scored as 80, an irrelevant diagnosis was scored as 50 and a relatively wrong diagnosis, was scored between zero to 10. Similarly, for task 2, consensus for the Management Advice Score (MAS), was based on the how accurate or relevant were the Specialist-Consultations and investigations advised by each GPT version.
The obtained scores from each of the four Radiologists were averaged separately for each task and each temperature setting. The Average scores for both tasks of each of the three LLMs at each of the three temperatures was compiled and analyzed.
Statistical methods: Averaged score for each report was calculated as the arithmetic mean of the scores of each investigator. The average scores from three different temperatures of each LLM algorithm were subject to standard descriptive quantitative analysis. Mean, standard error of mean, 95% confidence intervals, standard deviation, median and mode (measures of central tendency) were all calculated. Range, interquartile range, minimum, maximum, skewness, kurtosis, Shapiro-Wilk W, Shapiro-Wilk p, and 25th/50th/75th percentiles were calculated. Data was collected using Microsoft Excel spreadsheets and analyzed on IBM SPSS.