ESR Member Area

Back to the list

Congress: ECR24

Poster Number: C-19422

Type: EPOS Radiologist (scientific)

Authorblock: T. Santner¹, C. Ruppert², S. Gianolini³, J-G. Stalheim⁴, S. Frei⁵, M. Hondl Adametz⁶, V. Fröhlich⁷, S. Hofvind⁸, G. Widmann¹; ¹Innsbruck/AT, ²Zürich/CH, ³Glattpark/CH, ⁴Bergen/NO, ⁵Lausanne/CH, ⁶Vienna/AT, ⁷Wiener Neustadt/AT, ⁸Oslo/NO

Disclosures:

Tina Santner: Nothing to disclose

Carlotta Ruppert: Employee: b-rayZ AG

Stefano Gianolini: Nothing to disclose

Johanne-Gro Stalheim: Nothing to disclose

Stephanie Frei: Nothing to disclose

Michaela Hondl Adametz: Nothing to disclose

Vanessa Fröhlich: Nothing to disclose

Solveig Hofvind: Nothing to disclose

Gerlig Widmann: Nothing to disclose

Keywords: Artificial Intelligence, Breast, Mammography, Screening, Quality assurance

Results

Results:

Significant inter-reader variability among human readers with poor to moderate agreement (κ=-0.018 to κ=0.41) was observed, with some readers showing more homogenous interpretations of quality features and overall quality than others. Interestingly, we did not find evident direct correlations between individual experience or background and the differences in ratings. In comparison, the AI software demonstrated higher consistency with fewer outliers (positive as well as negative), highlighting its generalization capability.

For a comprehensive evaluation of overall image quality, we conducted an analysis of agreement across all readers, as illustrated in Figure 1. The highest recorded accuracy, 61%, was observed between reader 1 and reader 2, while the lowest, 26%, occurred between reader 1 and reader 5, as well as between the b-box software and reader 5.

Fig 1: Comparison of measured accuracies for overall image quality according to PGMI between all human readers and the b-box software

Figure 2 provides a visual representation of the disparities in Cohen’s kappa resulting from subtracting the mean human inter-reader agreement from the mean software performance (compared to human readers). It elucidates instances where the software outperforms or falls behind human performance. It surpassed human inter-reader agreement in detecting medial glandular tissue cuts (mean Cohen’s kappa: 0.43 (software) vs. 0.36 (experts)), mammilla deviation (mean Cohen’s kappa: 0.26 (software) vs. 0.24 (experts)), pectoral muscle detection (mean Cohen’s kappa: 0.72 (software) vs. 0.70 (experts)), and pectoral angle measurement (mean Cohen’s kappa: 0.49 (software) vs. 0.42 (experts)). For the remaining features, the software exhibited performance comparable to human assessment, with the highest difference measured at 0.18 for the PNL PGMI feature.

Fig 2: Visualizing disparities in Cohen’s kappa between software and human inter-reader agreement

GALLERY