In this study, we evaluated the performance of a bone mineral density (BMD) classification model using CXRs obtained from three distinct clinical settings—a tertiary university hospital (Hospital A), a secondary healthcare facility (Hospital B), and a facility serving veterans (Hospital C). The internal dataset from Hospital A comprised 15.98% male patients with a mean age of 58.7±6.76 years, while the external dataset from the same institution included 11.02% male patients with a mean age of 59.01±6.65 years. Hospital B’s dataset consisted of 44.46% male patients with a mean age of 59.38±7.31 years, and Hospital C’s dataset included 56.2% male patients with a mean age of 73.64±6.74 years.
As shown in Figure 2, the prevalence of BMD categories varied across datasets. In Hospital A, normal, osteopenia, and osteoporosis cases accounted for 34.1%, 31.8%, and 31.8% (internal) and 36.5%, 35.4%, and 28.0% (external). Hospital B had the highest normal BMD prevalence (53.5%), with osteopenia and osteoporosis at 33.2% and 13.3%. In contrast, Hospital C had the most osteopenia cases (42.5%), while normal and osteoporosis cases were 26.7% and 30.8%.
Overall, our model demonstrated robust performance across these diverse settings in classifying BMD into normal, osteopenia, and osteoporosis categories. Specifically, the AUC values for each category were as follows: for Hospital A’s internal dataset, 0.93, 0.76, and 0.91 (mean AUC= 0.867); for Hospital A’s external dataset, 0.94, 0.75, and 0.88 (mean AUC= 0.857); for Hospital B, 0.86, 0.71, and 0.90 (mean AUC= 0.821); and for Hospital C, 0.82, 0.63, and 0.78 (mean AUC= 0.743). Mean sensitivity and specificity values across the three BMD categories were similarly high, reaching 0.70 and 0.85 for Hospital A’s internal dataset, 0.68 and 0.84 for Hospital A’s external dataset, 0.65 and 0.81 for Hospital B, and 0.56 and 0.76 for Hospital C.
In General, the robust performance observed in Hospital A’s internal (mean AUC= 0.867) and external (mean AUC= 0.857) datasets, along with the model's adaptability in other clinical settings (mean AUC= 0.823 for Hospital B and 0.743 for Hospital C), underscores its strong generalizability. Although performance at Hospital C was slightly lower—likely due to variations in imaging protocols and an older patient demographic—the results indicate that the model is effective across diverse clinical environments. While the model excels at distinguishing normal BMD from osteoporosis, the consistently lower AUC for osteopenia across all datasets highlights the inherent challenge of accurately classifying this intermediate category. Moreover, despite high specificity across institutions, the variation in sensitivity suggests a more conservative detection approach, which may require further optimization for broader clinical applicability.