Back to the list
Congress: ECR24
Poster Number: C-19422
Type: EPOS Radiologist (scientific)
Authorblock: T. Santner1, C. Ruppert2, S. Gianolini3, J-G. Stalheim4, S. Frei5, M. Hondl Adametz6, V. Fröhlich7, S. Hofvind8, G. Widmann1; 1Innsbruck/AT, 2Zürich/CH, 3Glattpark/CH, 4Bergen/NO, 5Lausanne/CH, 6Vienna/AT, 7Wiener Neustadt/AT, 8Oslo/NO
Disclosures:
Tina Santner: Nothing to disclose
Carlotta Ruppert: Employee: b-rayZ AG
Stefano Gianolini: Nothing to disclose
Johanne-Gro Stalheim: Nothing to disclose
Stephanie Frei: Nothing to disclose
Michaela Hondl Adametz: Nothing to disclose
Vanessa Fröhlich: Nothing to disclose
Solveig Hofvind: Nothing to disclose
Gerlig Widmann: Nothing to disclose
Keywords: Artificial Intelligence, Breast, Mammography, Screening, Quality assurance
Results

Results:

Significant inter-reader variability among human readers with poor to moderate agreement (κ=-0.018 to κ=0.41) was observed, with some readers showing more homogenous interpretations of quality features and overall quality than others. Interestingly, we did not find evident direct correlations between individual experience or background and the differences in ratings. In comparison, the AI software demonstrated higher consistency with fewer outliers (positive as well as negative), highlighting its generalization capability.

For a comprehensive evaluation of overall image quality, we conducted an analysis of agreement across all readers, as illustrated in Figure 1. The highest recorded accuracy, 61%, was observed between reader 1 and reader 2, while the lowest, 26%, occurred between reader 1 and reader 5, as well as between the b-box software and reader 5.

Fig 1: Comparison of measured accuracies for overall image quality according to PGMI between all human readers and the b-box software

Figure 2 provides a visual representation of the disparities in Cohen’s kappa resulting from subtracting the mean human inter-reader agreement from the mean software performance (compared to human readers). It elucidates instances where the software outperforms or falls behind human performance.  It surpassed human inter-reader agreement in detecting medial glandular tissue cuts (mean Cohen’s kappa: 0.43 (software) vs. 0.36 (experts)), mammilla deviation (mean Cohen’s kappa: 0.26 (software) vs. 0.24 (experts)), pectoral muscle detection (mean Cohen’s kappa: 0.72 (software) vs. 0.70 (experts)), and pectoral angle measurement (mean Cohen’s kappa: 0.49 (software) vs. 0.42 (experts)). For the remaining features, the software exhibited performance comparable to human assessment, with the highest difference measured at 0.18 for the PNL PGMI feature. 

Fig 2: Visualizing disparities in Cohen’s kappa between software and human inter-reader agreement

GALLERY