Study Design and Data Collection
This study was conducted retrospectively, focusing on the evaluation of Milvue Suite v2.1 (Milvue, Paris, France), a CE-marked artificial intelligence (AI) algorithm. The algorithm is intended to detect pathologies on frontal and lateral chest radiographs, irrespective of the patient's positioning, among which the 5 following common findings evaluated in this study: pneumothorax, fracture, pleural effusion, non-nodular pulmonary opacity, and pulmonary nodule.
Ground Truth Establishment
A total of 202 consecutive radiograph sets from outpatients who underwent chest radiography at the emergency department of a large university hospital (CHU Rennes, France) were included in the study. The radiograph sets, in DICOM format along with a brief clinical context, were anonymized and uploaded to a dedicated interface for evaluation by three senior thoracic radiologists of our institution (ML, PAL, SL). These radiologists independently reviewed each set to identify the presence or absence of the five specified findings, which were categorized as either critical (pneumothorax, fracture, pleural effusion) or relevant (non-nodular pulmonary opacity, pulmonary nodule). The consensus process did not involve providing the radiologists with specific localization information for the findings within the radiographs. Each radiograph set which did not reach a consensus evaluation was excluded, resulting in a study dataset of 119 radiograph sets.
AI Model Evaluation
Following the establishment of the ground truth, the Milvue Suite AI model was tasked with evaluating each radiograph set. The AI's assessment focused on determining the presence, doubt, or absence of the five targeted findings. For the purposes of statistical analysis, findings marked as "doubt" by the AI were classified as positive. A radiograph set was deemed abnormal if the AI identified at least one of the findings as present.
Statistical Analysis
The performance of the AI model was evaluated on a per-radiograph set basis, as well as pooled and individually for each of the five findings. The analysis included calculating the overall standalone performance of the AI model, utilizing metrics such as accuracy, sensitivity (Se), specificity (Sp), negative predictive value (NPV), positive predictive value (PPV). These metrics were derived from a comparison between the AI model's assessments and the consensus ground truth established by the three expert radiologists on a triaging mode (the localization of the AI finding(s) was not reviewed).