Back to the list
Congress: ECR25
Poster Number: C-16080
Type: Poster: EPOS Radiologist (scientific)
Authorblock: B. J. Van Der Zwart1, H. C. Ruitenbeek1, F. J. Bruun2, A. Lenskjold2, M. Boesen2, K. Ziegeler3, K-G. A. Hermann3, E. Oei1, J. J. Visser1; 1Rotterdam/NL, 2Copenhagen/DK, 3Berlin/DE
Disclosures:
Bastiaan Johannes Van Der Zwart: Nothing to disclose
Huibert C Ruitenbeek: Nothing to disclose
Frederik J. Bruun: Nothing to disclose
Anders Lenskjold: Nothing to disclose
Mikael Boesen: Advisory Board: Radiobotics
Katharina Ziegeler: Nothing to disclose
Kay-Geert A. Hermann: Nothing to disclose
Edwin Oei: Nothing to disclose
Jacob Johannes Visser: Nothing to disclose
Keywords: Artificial Intelligence, Musculoskeletal bone, Conventional radiography, Comparative studies, Computer Applications-Detection, diagnosis, Trauma
Methods and materials

Background: Artificial intelligence (AI) models for fracture detection have shown promise in improving diagnostic efficiency and accuracy in radiology workflows. However, the generalizability of these models across different healthcare systems remains unknown as this could differ due to regional variations in imaging protocols and clinical workflows.

Data collectionData was collected retrospectively at the three university hospitals: Copenhagen University Hospital Bispebjerg and Frederiksberg, Denmark (BFH), Charité Universitätsmedizin in Berlin, Germany (CUB), and Erasmus Medical Center in Rotterdam, The Netherlands (EMC). Each institution created a dataset of 500 consecutively sampled radiographic examinations, including all projections available, as well as corresponding clinical referral notes and radiology reports. If follow-up MRI or CT imaging was performed within two weeks of the initial radiograph, the resulting reports were also collected. Patients eligible for inclusion were those aged 21 years or older with a clinical suspicion of a traumatic fracture, referred for imaging of the appendicular skeleton and/or pelvis. This encompassed the areas shoulder, upper arm, forearm, elbow, wrist, hand, fingers, hip, pelvis, upper leg, lower leg, knee, ankle, foot, and toes. The exclusion criteria included follow-up imaging referrals, the presence of surgical hardware or a cast near the suspected fracture site, and radiographs deemed unsuitable for clinical interpretation. Reasons for unsuitability involved errors in patient positioning, improper collimation or x-ray exposure, motion artifacts, anatomical side marker omissions, and other image quality issues.

AI tool: The backbone of the AI tool used in the study is a deep Convolutional Neural Network based on the Detectron2 framework that was developed as a support tool for detecting fractures in the appendicular skeleton of patients aged 2 years and older. (RBfracture™; Radiobotics, Copenhagen, Denmark) It is CE-marked as a MDR class IIa medical device and is intended to be used in a clinical setting as a decision support tool. The input is a radiograph in Digital Imaging and Communications in Medicine (DICOM) format and the resulting output is an image of the radiograph depicting the detected fractures with a box bounding as a secondary capture. The AI tool was developed using a dataset consisting of more than 320,000 radiographs collected from more than 100 radiology departments across multiple continents. The data set was split into a training set (80%), a tuning set (10%) and an internal validation set (10%). The operating point of the AI tool was fixed prior to this study and no calibration was made to the local datasets.

Reference standard: All 1500 exams, along with the corresponding clinical referral notes and radiology reports, were reviewed by a musculoskeletal (MSK) radiologist with 10 years of experience (GG). Using the online annotation platform Darwin (V7 Labs, London, England), every visible fracture was annotated with a bounding box. When available, follow-up MRI or CT reports were also consulted. Fractures classified as acute, subacute, or healing were marked as positive, while chronic or completely healed fractures requiring no clinical attention were excluded from annotation. To account for multiple fractures within the exams, each fracture was assigned a unique identifier across the projections. Additionally, one reference reader from each hospital (HR, FJB, DD), trained in medical image analysis, independently annotated the 500 cases from their respective institutions using the same method. These readers were blinded to the MSK radiologist’s annotations. If the bounding boxes created by two reference readers overlapped with an intersection over union of at least 25%, the reference standard was defined as the smallest bounding box that encompassed both. Any disagreements between the two reference readers were resolved by a radiologist with a minimum of 5 years of experience (EHGO, PH, KZ). These adjudicated annotations were used to establish the reference standard. In cases where the adjudicating radiologist could not reach a definitive conclusion, the patient was excluded from the analysis.

Statistical analysis: The statistical analysis was performed on both patient- and fracture-level and performance metrics were calculated for each hospital. Patient-wise, the specificity (SPEPW) was defined as the proportion of patients with no fracture detected by the AI tool amongst patients without any fracture as determined by the reference standard, and the sensitivity (SEPW) was defined as the proportion of patients in whom all fractures were detected (each unique fracture in at least one radiograph). To allow for an evaluation of the utility of the AI tool for patient triage, we also defined an alternative measure of sensitivity (SEPW-triage) as the proportion of patients in whom at least one fracture was detected. The fracture-wise sensitivity (SEFW) was defined as the proportion of unique fractures correctly identified in at least one projection amongst all fractures, with multiple fractures per patient analyzed individually where appropriate. The average number of false-positive fractures per patient (FPPPFW) was defined as the total number of marks put outside of a fracture divided by the number of patients. A fracture was detected if there was an overlap between the reference standard and the bounding box indicated by the AI tool.

Receiver operating characteristics (ROC) curves were drawn by artificially varying the operating point of the AI tool, and the area under the ROC curve (AUC) was calculated.

The AI tool’s performance between the three hospitals were compared by means of AUC-ROC, using DeLong's method[1], with p-values <0.05 considered significant.

GALLERY