Securing Collaborative Medical AI by Using Differential Privacy: Domain Transfer for Classification of Chest Radiographs
Abstract
Purpose
To investigate the integration of differential privacy (DP) and analyze its impact on model performance as compared with models trained without DP.
Materials and Methods
Leveraging more than 590 000 chest radiographs from five institutions, including VinDr-CXR from Vietnam, ChestX-ray14 and CheXpert from the United States, UKA-CXR from Germany, and PadChest from Spain, the authors evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in classifying cardiomegaly, pleural effusion, pneumonia, atelectasis, and healthy individuals. Diagnostic performance and sex-specific and age-specific demographic fairness of DP-DT and of non–DP-DT models were compared using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity as secondary metrics, and evaluated for statistical significance using paired Student t tests.
Results
Even with high privacy levels (ε ≈ 1), DP-DT showed no evidence of differences compared with non–DP-DT in terms of a decrease in AUC of cross-institutional performance as compared with single-institutional performance (VinDr-CXR: 0.07 vs 0.07, P = .96; ChestX-ray14: 0.07 vs 0.06, P = .12; CheXpert: 0.07 vs 0.07, P = .18; UKA-CXR: 0.18 vs 0.18, P = .90; and PadChest: 0.07 vs 0.07, P = .35). Furthermore, AUC differences between DP-DT and non–DP-DT models were less than 1% for all sex subgroups (P > .33 for female and P > .22 for male, for all domains) and nearly all age subgroups (P > .16 for younger participants, P > .33 for adults, and P > .27 for older adults, for nearly all domains).
Conclusion
Cross-institutional performance of artificial intelligence models was not affected by DP.
Keywords: Convolutional Neural Network (CNN), Transfer Learning, Supervised Learning, Diagnosis, Forensics, Computer Applications–General, Image Postprocessing, Informatics, Neural Networks, Thorax, Computer-Aided Diagnosis, Deep Learning, Domain Transfer, Differential Privacy, Privacy-Preserving AI, Chest Radiograph
Supplemental material is available for this article.
© RSNA, 2023
See also the commentary by Suri and Summers in this issue.
Summary
Domain transfer performance of artificial intelligence chest radiograph models trained using differential privacy (DP) was comparable with the performance of non-DP models.
Key Points
■ The cross-institutional performance of chest radiograph classification models trained with differential privacy (DP) (ie, DP–enhanced domain transfer [DP-DT]) was comparable with the cross-institutional performance of models trained without DP (ie, non–DP-DT) (P > .12 across all domains).
■ DP-DT did not increase sex-specific (P > .22 across all domains) or age-specific (P > .16 across all domains) performance differences compared with non–DP-DT.
Introduction
As the amount of patient data is ever-increasing, employing artificial intelligence (AI) models becomes more appealing, as the performance and capabilities of these models will presumably rise. The prevailing strategy to increase the amount of training data is collaboration between institutions. If models can be trained on big multi-institutional datasets, they perform better on new external datasets (1–4). We refer to this paradigm as domain transfer (DT) performance, that is, the process of testing the trained models on samples from a different distribution.
However, strict data sharing policies rightfully prevent unconditional access to patient data by external institutions for the training of deep learning models. Therefore, privacy-preserving methods to train AI models are needed. Federated learning (5–9) offers a potential solution because it does not require the data provider to transfer the data. However, it has consistently been shown that federated learning is not truly privacy preserving, as network parameters and gradients are vulnerable to information breach through membership inference and reconstruction attacks (10–12) (Fig 1).

Figure 1: The problem of conventional collaboration between institutions for sharing artificial intelligence (AI) models. (A) Center 1 conventionally trains an AI model for supine chest radiographs by using its own data of patients in intensive care. It aims at sharing the AI model with another hospital for diagnosis of radiographs taken in upright position. However, an eavesdropper gets access to this model because of an information breach and reconstructs the images of center 1 that were used for training that model. (B) Diagram shows a similar scenario, with the difference being that center 1 shares a differential privacy–trained model of supine chest radiographs from patients in intensive care with center 2. In this case, while center 2 can successfully perform diagnosis using this AI model on its upright radiographs, neither center 2 nor an eavesdropper can reconstruct the original training images of center 1 because of the privacy-preserving nature of the model.
A key method for guaranteeing patient privacy is differential privacy (DP) (13), which has gained a lot of attention in the AI community recently (11,14). DP is a concept of formally quantifying privacy guarantees (eg, by using parameters ε and δ) and obtaining insights from sensitive datasets while protecting individual data points within them (11,13). Within the setting of deep learning, differentially private stochastic gradient descent (15) is a training paradigm that adds calibrated noise to gradients during the training process and limits the amount of information that gradient updates carry. However, higher privacy levels may come with a privacy-performance trade-off and a privacy-fairness trade-off (16,17). Previous research (11) empirically examined DP training of diagnostic AI models by using a large chest radiograph cohort for single-institutional application and observed a small yet significant drop in diagnostic performance, while maintaining guaranteed privacy.
In this study, we focused on the domain transferability of models that have been trained with DP. We assessed whether training with DP impacted DT performance. It is important to note that we examined only the transfer of AI models that are guaranteed to be privacy preserving to external partners. Consequently, we performed a large-scale analysis of DP-enhanced DT (DP-DT) by using a total of 590 000 radiographs from five institutions, covering a variety of different imaging settings.
To the best of our knowledge, this is the first analysis of DP in external domains for medical AI models. In particular, we compared the performance of diagnostic AI models in external domains trained with and without DP. We hypothesized that DP maintains—and potentially even increases—cross-institutional diagnostic performance because of the regularization effect of DP and because the formal guarantees of DP do not impact generalization if sufficient data are available. We also performed a detailed investigation to determine if DP leads to a reduction in performance for underrepresented groups, as this is a concern that has been hypothesized in the literature (18) and is particularly accentuated in medical AI models because of the potential impact on patients’ health (19).
Materials and Methods
This retrospective study was performed in accordance with relevant local and national guidelines and regulations and approved by the Ethics Committee of the Medical Faculty of RWTH Aachen University (reference no. EK 028/19). The requirement to obtain individual informed consent was waived.
Patient Cohorts
A total of 591 682 frontal chest radiographs from multiple institutions were included, comprising the VinDr-CXR (20) dataset with 18 000, the ChestX-ray14 (21) dataset with 112 120, the CheXpert (22) dataset with 157 676, the UKA-CXR (9,11,23) dataset with 193 361, and the PadChest (24) dataset with 110 525 radiographs. As one patient might have multiple radiographs, we calculated privacy values per image.
Experimental Design
Two distinct networks, specifically, models trained employing DP or non-DP training, were trained on a single dataset. Subsequently, testing was performed on a separate held-out test set of the same dataset (single institutional) and corresponding test sets from the remaining datasets (cross-institutional), separately for both networks, resulting in single-institutional and cross-institutional performances for both DP-DT and non–DP-DT scenarios. Consequently, the held-out test sets were used for single-institutional and cross-institutional testing and comprised 3000 (VinDr-CXR), 25 596 (ChestX-ray14), 29 320 (CheXpert), 39 824 (UKA-CXR), and 22 045 (PadChest) chest radiographs (Table 1). Official test sets were employed for the VinDr-CXR and ChestX-ray14 datasets. As no official test sets were available for the CheXpert, UKA-CXR, and PadChest datasets, images were randomly divided into 80% training and 20% test sets with a normal distribution. This division was patient-centric, ensuring all radiographs from one patient were grouped together, safeguarding patient-specific integrity and reducing potential underestimation of variance. The same training and test sets were used for both DP and non-DP scenarios, making our comparisons intrinsically paired. It should be noted that we used a multilabel classification approach, optimizing for average performance across all labels, and did not perform a detailed comparison for individual diseases.
![]() |
Harmonization of Labeling Systems
In this study, the target labels for diagnosis were cardiomegaly, pleural effusion, pneumonia, and atelectasis. Additionally, we introduced a “healthy” label for individuals who were not diagnosed with any pathologic condition as recognized by the original datasets. A binary multilabel classification system was employed, meaning that each image could be diagnosed as either positive or negative for every disease. As a result, labels in datasets with nonbinary labeling systems were converted to binary ones. Specifically, for datasets with certainty levels in labels (CheXpert), the “certain negative” and “uncertain” classes were considered negative labels, while only the “certain positive” class was counted as a positive label. For datasets with severity levels in labels (UKA-CXR), the threshold for differentiating between negative and positive labels was chosen as the middle of the severity levels. Last, in datasets with individual labels for each side of the body (UKA-CXR), both right and left labels for each disease were merged to form a single label per disease, meaning that the presence of a disease in at least one side was counted as positive.
Privacy-Performance and Privacy-Fairness Trade-offs
The privacy-performance trade-off was measured by analyzing each model’s diagnostic performance by using the area under the receiver operating characteristic curve (AUC) as the primary evaluation metric, with accuracy, sensitivity, and specificity as supporting evaluation metrics. The privacy-fairness trade-off was assessed by considering different demographics and subgroups within each dataset. We assumed that the network would be fair if it did not discriminate against any patient subgroups when introducing DP, meaning it should have the same performance in diagnosing any patient subgroups with and without using DP. In addition to comparing the diagnostic performance in terms of AUC among different subgroups, the statistical parity difference (25) was further used for demographic fairness analysis. Demographic subgroups considered in this study included female patients, male patients, and patients within the age ranges of 0 to less than 40 years, 40 to less than 70 years, and 70 to 100 years.
Image Preprocessing
As described in previous studies (9,11), a unified image preprocessing strategy was applied to all datasets, which included the following steps: (a) resizing all images to a size of 512 × 512 pixels, (b) performing min-max normalization as proposed by Johnson et al (26), and (c) performing histogram equalization.
Deep Learning Network Architecture and Training
To ensure compatibility with DP training, a modified ResNet9 architecture was used, incorporating modifications proposed by Klause et al (27) and by He et al (28). Group normalization (29) with groups of 32 was used instead of batch normalization (30). Mish (31) was selected as the activation function. The inputs to the network were three-channel images, with the output of the first layer having 64 channels. Finally, a fully connected layer reduced the 512 features to five. The logistic sigmoid function was employed for converting output predictions to class probabilities.
Previous work (11) demonstrated that appropriate pretraining is essential for the convergence of medical DP models. Consequently, our network was pretrained on the publicly available MIMIC-CXR (26) dataset, consisting of 210 652 frontal chest radiographs. All models were optimized using the NAdam optimizer, with learning rates ranging from 1 × 10−4 to 5 × 10−4 for optimal convergence, and no weight decay applied. Binary cross-entropy was chosen as the loss function. Data augmentation during non-DP training was implemented by applying random rotation within the range of −8° to 8, inclusive, and flipping (9). In contrast, no data augmentation was performed during DP training because of its reported negative effect (11). The maximum allowed gradient norm was found to be an influential factor in DP network convergence, with an optimal value of 1.5 observed for all training. Each point in the DP training batches was sampled with a probability of 128 divided by the sample size for each dataset, while a batch size of 128 was used in non-DP training. A DP accountant employing Rényi DP (32) was chosen. This mechanism oversees the “privacy budget” (represented by ε and δ) and ensures its adherence to predetermined bounds. A δ value of 6 × 10−6 was selected for all datasets (11). The value of ε depends on the introduced noise, the set δ, and factors like training steps and batch size. Given the dataset diversity, each neural network’s convergence step determined the reported ε value.
Evaluation Metrics and Statistical Analysis
Statistical analysis was carried out using Python (version 3.0) and the associated packages SciPy and NumPy. We employed the AUC as our primary evaluation metric. Individual label results were averaged without weighting. Accuracy, sensitivity, and specificity across varied demographics were calculated as secondary evaluation metrics. Bootstrapping (33) was applied to each test set to assess the statistical spread, with 1000 redraws. In each redraw, radiographs were randomly picked, matching the set size, and this selection allowed repetitions. To assess statistical significance between evaluation metrics obtained with DP compared with those obtained without DP, a two-tailed Student t test was used. Multiplicity-adjusted P values were determined based on the false discovery rate to account for multiple comparisons, and the familywise α threshold was set at .05.
Code Availability
All source codes for training and evaluation of the deep neural networks, data augmentation, image analysis, and preprocessing are publicly available at https://github.com/tayebiarasteh/privacydomain. All code for the experiments was developed in Python version 3.10 using the PyTorch version 1.13 framework. DP was developed using Opacus version 1.3.0.
Data Availability
The accessibility of the data utilized in this study is as follows: ChestX-ray14 and PadChest datasets are publicly available via https://www.v7labs.com/open-datasets/chestx-ray14 and https://bimcv.cipf.es/bimcv-projects/padchest/, respectively. VinDr-CXR and MIMIC-CXR datasets are restricted-access resources that can be accessed from PhysioNet by agreeing to its data protection requirements under https://physionet.org/content/vindr-cxr/1.0.0/ and https://physionet.org/content/mimic-cxr-jpg/2.0.0/, respectively. CheXpert data could be requested from Stanford University at https://stanfordmlgroup.github.io/competitions/chexpert/. The UKA-CXR data are not publicly accessible as they are internal data of patients of University Hospital RWTH Aachen in Aachen, Germany. Data access can be granted upon reasonable request to the corresponding author.
Hardware
The hardware used in our experiments were Intel CPUs with 18 cores and 32-GB RAM and an NVIDIA GPU, RTX 6000 with 24-GB memory.
Results
Dataset Characteristics
The median and mean ages of all patients were 61 years and 59 years ± 18 [SD], respectively, with a range from 1 to 111 years. Table 1 reports the statistics of each dataset, including labeling systems, age and sex distributions, and label distributions. Additionally, Figures S1–S6 provide further information on the distribution of sample sizes per label and demographic subgroup for each of the datasets.
Comparison of DP-DT and Non–DP-DT Diagnostic Performance
Figure 2 shows the diagnostic performance of DP networks across varying privacy budgets in different external domains, averaged over all labels, including cardiomegaly, pleural effusion, pneumonia, atelectasis, and healthy. The results demonstrate no evidence of differences in performance between the different ε budgets, ranging from ε ≈ 1 to ε = ∞ (non-DP). Table 2 presents a more detailed comparison, showing the differences in AUC between two scenarios: (a) DP-DT with a comparatively high privacy level of ε ≈ 1 (VinDr-CXR: ε = 1.17, ChestX-ray14: ε = 1.01, CheXpert: ε = 0.98, UKA-CXR: ε = 0.98, and PadChest: ε = 0.72) and δ = 6 × 10−6 and (b) non–DP-DT (ε = ∞), as compared with conventional single-institutional training without privacy measures. Table S1 reports further evaluation metrics in terms of accuracy, sensitivity, and specificity, comparing DP-DT with non–DP-DT. For a better comparison, we measured the decrease in AUC of cross-institutional performance as compared with single-institutional performance for both setups. On average, these values were consistent between DP-DT and non–DP-DT, and no evidence of differences was found (VinDr-CXR: 0.07 vs 0.07, P = .96; ChestX-ray14: 0.07 vs 0.06, P = .12; CheXpert: 0.07 vs 0.07, P = .18; UKA-CXR: 0.18 vs 0.18, P = .90; and PadChest: 0.07 vs 0.07, P = .35).

Figure 2: Results of transferring differential privacy (DP) models with different ε values to different domains. The values correspond to average area under the receiver operating characteristic curve (AUC) results over all labels by using networks with (A) ε ≈ 1, (B) 2 < ε < 4, (C) 4 < ε < 9, and (D) ε = ∞ (non-DP). Each row corresponds to a training domain, and each column corresponds to a test domain. The privacy budgets of the DP networks corresponding to each dataset for (A–C) are as follows: VinDr-CXR (VDR): ε = 1.17, 3.24, and 4.29; ChestX-ray14 (C14): ε = 1.01, 3.37, and 7.83; CheXpert (CPT): ε = 0.98, 3.30, and 6.48; UKA-CXR (UKA): ε = 0.98, 3.46, and 8.81; and PadChest (PCH): ε = 0.72, 3.58, and 7.41, respectively.
![]() |
Figures 3 and S7–S9 showcase the diagnostic performance of DP networks for individual diseases for different ε values in various external domains. Table S2 compares the cross-institutional performance of networks between DP-DT with ε ≈ 1 and non–DP-DT for individual labels. We observed no evidence of differences between DP-DT and non–DP-DT for all individual labels (pleural effusion: 0.85 vs 0.86, P = .55; pneumonia: 0.74 vs 0.74, P = .82; atelectasis: 0.67 vs 0.67, P = .96; and healthy: 0.77 vs 0.77, P = .70), except for cardiomegaly (0.81 vs 0.83, P = .01).

Figure 3: Results of transferring differential privacy (DP) models with ε ≈ 1 to different domains for individual labels. The area under the receiver operating characteristic curve (AUC) values correspond to (A) average over all labels, (B) cardiomegaly, (C) pleural effusion, (D) pneumonia, (E) atelectasis, and (F) healthy. Each row corresponds to a training domain, and each column corresponds to a test domain. The privacy budgets of the DP networks corresponding to each dataset are as follows: VinDr-CXR (VDR): ε = 1.17, ChestX-ray14 (C14): ε = 1.01, CheXpert (CPT): ε = 0.98, UKA-CXR (UKA): ε = 0.98, and PadChest (PCH): ε = 0.72, with δ = 0.000006 for all datasets.
DP-DT Has No Effect on Sex-based Fairness
There has been concern that the application of DP can lead to decreased performance in groups that are underrepresented in the dataset (18).
To test this, we performed a subanalysis in male and female patients. Figure 4 and Table 3 demonstrate that on average, DP-DT resulted in a less than 1% AUC difference as compared with non–DP-DT for female and male subgroups in all datasets, and no evidence of differences was found (VinDr-CXR: P = .46 for female and P = .22 for male, ChestX-ray14: P = .45 for female and P = .37 for male, CheXpert: P = .39 for female and P = .29 for male, UKA-CXR: P = .40 for female and P = .33 for male, PadChest: P = .43 for female and P = .22 for male). A more detailed analysis of each sex subgroup, including further evaluation metrics in terms of accuracy, sensitivity, and specificity, comparing DP-DT with non–DP-DT is reported in Tables S3 and S4, and statistical parity difference values for individual sex subgroups are reported in Table S5.

Figure 4: Comparison between non–DP-DT (ε = ∞) and DP-DT (ε ≈ 1) in terms of average AUC for each sex subgroup. Each row corresponds to a training domain, and each column corresponds to a test domain. The AUC values correspond to (A) female subgroups according to DP-DT with ε ≈ 1, (B) male subgroups according to DP-DT with ε ≈ 1, (C) female subgroups according to non–DP-DT (ε = ∞), and (D) male subgroups according to non–DP-DT (ε = ∞). AUC = area under the receiver operating characteristic curve, C14 = ChestX-ray14, CPT = CheXpert, DP-DT = differential privacy–enhanced domain transfer, PCH = PadChest, UKA = UKA-CXR, VDR = VinDr-CXR.
![]() |
DP-DT Has No Effect on Age-based Fairness
We repeated our experiments for different age subgroups to similarly test if age-specific bias might be introduced. In Figure 5, we show the average AUC values individually for different age subgroups for every dataset evaluated on external domains both for DP-DT and non–DP-DT. The differences between these two methods are reported in Table 4. Except for the VinDr-CXR dataset with a maximum AUC difference of 0.04 (P = .03 for individuals aged 70 to 100 years), which includes a small test sample size (n = 149), we again observed, on average, no evidence of differences, that is, a maximum AUC difference of only 1% when comparing DP-DPT with non–DP-DT for all three age subgroups, including younger individuals (0 to <40 years), middle-aged individuals (40 to <70 years), and older individuals (70–100 years), in all datasets (P > .25 for all cases). In a manner akin to the sex subgroups, more detailed age subgroup analyses are provided in Tables S6–S9.

Figure 5: Comparison between non–DP-DT (ε = ∞) and DP-DT (ε ≈ 1) in terms of average AUC for each age subgroup. Each row corresponds to a training domain, and each column corresponds to a test domain. The AUC values correspond to (A) 0 to less than 40 years subgroups according to DP-DT with ε ≈ 1, (B) 40 to less than 70 years subgroups according to DP-DT with ε ≈ 1, (C) 70 to 100 years subgroups according to DP-DT with ε ≈ 1, (D) 0 to less than 40 years subgroups according to non–DP-DT (ε = ∞), (E) 40 to less than 70 years subgroups according to non–DP-DT (ε = ∞), and (F) 70 to 100 years subgroups according to non–DP-DT (ε = ∞). AUC = area under the receiver operating characteristic curve, C14 = ChestX-ray14, CPT = CheXpert, DP-DT = differential privacy–enhanced domain transfer, PCH = PadChest, UKA = UKA-CXR, VDR = VinDr-CXR.
![]() |
Discussion
In this study, we investigated the domain transferability of highly privacy-preserving AI models in radiology for healthy patients and for patients diagnosed with cardiomegaly, pleural effusion, pneumonia, and atelectasis. Our analysis included a total of 591 682 frontal chest radiographs from five different datasets from Vietnam, the United States, Germany, and Spain, encompassing various imaging and labeling domains, such as standard upright imaging and intensive care imaging. As a baseline, we compared the performance of DP-trained networks in external domains with that of non–DP-trained networks. We used DP as the privacy-preserving technology to protect the private networks. Along with the comparison of network performance against the baseline, we analyzed the effects of DP on the fairness of the AI models when applied to different demographic subgroups. This is a known issue when employing DP in deep learning models (11,16,34). Our analysis aimed to provide insight into the potential trade-offs between privacy preservation and fairness, as well as accuracy, in AI models for medical diagnosis (11,17). Our results indicate that at all privacy budgets—even with comparatively strict ε values around 1—all DP models trained on any of the included datasets performed similarly to their non-DP counterparts, and no evidence of differences was found in terms of average AUC when evaluated on external domains (P > .12 for all cases). However, discrepancies between DP-trained and non–DP-trained models were sometimes observed when testing for individual diseases, although no consistent trend in favor of either training paradigm was observed.
Previous work (11) demonstrated that increasing the ε value improves the diagnostic performance of a DP-trained AI model when tested on data from its own domain. Conversely, we show here that the cross-institutional performance of DP-trained models remains unaffected by increasing the ε value. This is important, as any AI model in clinical practice will basically be acting as a cross-institutional model. Thus, our setup is more reflective of the clinical situation. Previous research (11) has emphasized the importance of large, curated training datasets for the successful generalization of DP-trained AI models. Interestingly, our findings suggest that this might not always be the paramount factor for cross-institutional applications. For instance, the VinDr-CXR dataset, comprising only 15 000 training images, performed comparably to other datasets of varying sizes (ChestX-ray14: 86 524, CheXpert: 128 356, UKA-CXR: 153 537, and PadChest: 88 480) when juxtaposed with their non-DP counterpart for cross-institutional applications. We ascribe this observation to the inherent propensity of training with DP that mitigates overfitting during the training process.
We further investigated this finding for individual diseases, discovering that, except for cardiomegaly, the same pattern held for all individual diseases, as well as for healthy patients (P > .51 for all cases). Even for cardiomegaly, with P = .01, the AUC decrease was a mere 2.41%. Additionally, this finding was consistent across individual sex groups. We found that employing DP did not introduce any change in demographic parity in cross-institutional performance of the networks for both female and male subsets compared with non–DP-DT in terms of average AUC across all labels (P > .22 for all cases).
Last, except for only one age group subset of the VinDr-CXR dataset (ie, 70–100 years), all age groups from all datasets followed the same trend, where no evidence of differences was found (P > .16 for all cases). A closer examination of the age group 70–100 years in the VinDr-CXR dataset revealed that this test subset was a small and underrepresented group with only 32 cardiomegaly, four pleural effusion, nine pneumonia, three atelectasis, and four healthy samples. Thus, statistical fluctuations as the reason for this outlier are likely.
We recognize that attaining training convergence in DP AI models presents a more challenging and computationally intensive endeavor (11,14,15). Nevertheless, by providing access to our comprehensive framework and recommended configurations, we aspire to expedite progress in this research area. In our experiments, with consistent computational resources, the DP training took on average 10 times longer to converge, in terms of total training time, than non-DP training, depending on the network architecture and dataset. A detailed computational efficiency analysis is provided in Appendix S1. To the best of our knowledge, our study is the first to demonstrate this practical fact using various small and large real-world datasets from different domains.
While our study covered a wide range of real-world datasets, an important study limitation was that the main findings focused on the interpretation of chest radiographs using only ResNet9. To broaden our results, Appendix S1 presents an ablation study featuring two additional network architectures (28,35) and more imaging findings. Here, a trend similar to our main results was evident. In future work, we intend to apply DP-DT across varied domains, like gigapixel imaging in pathologic conditions, three-dimensional volumetric medical imaging, and more complex tasks, such as segmentation.
In summary, we conducted a comprehensive analysis of the application of DP in domain transferability of AI models based on chest radiographs, with our results demonstrating that employing DP in the training of diagnostic medical AI models does not impact the model’s diagnostic performance and demographic fairness in external domains. Accordingly, we advocate for researchers and practitioners to place heightened priority on the integration of DP when training diagnostic medical AI models intended for collaborative applications. The present study affirms that even with extremely high privacy levels (ε ≈ 1), DP does not cause minor trade-offs but exhibits nearly no impact on the network performance in external domains. We envisage that our observations will help streamline collaborations among institutions, foster the development of more precise diagnostic AI models, and, ultimately, enhance patient outcomes.
Acknowledgments
S.T.A., G.K., and D.T. designed the study. The manuscript was written by S.T.A. and reviewed and corrected by T.N., G.K., and D.T.. The experiments were performed by S.T.A. The software was developed by S.T.A. Illustrations were designed by S.T.A., M.L., and T.N. The statistical analyses were performed by S.T.A., M.L., S.N., G.K., and D.T. S.T.A. preprocessed the data. M.J.S., P.I., C.K., S.N., D.T., and G.K. provided clinical expertise. All authors read the manuscript and agreed to the submission of this paper.
Author Contributions
Author contributions: Guarantors of integrity of entire study, S.T.A., D.T.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting (S.T.A. only) or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, S.T.A., P.I., S.N., D.T.; clinical studies, C.K.; experimental studies, S.T.A., G.K., D.T.; statistical analysis, S.T.A., M.L., D.T.; and manuscript editing, S.T.A., M.L., M.J.S., P.I., C.K., S.N., G.K., D.T.
* G.K. and D.T. are co–senior authors.
This work was partially funded and supported by the Radiological Cooperative Network (RACOON) under the German Federal Ministry of Education and Research (BMBF) grant number 01KX2021 and has been funded by BMBF and the Bavarian State Ministry for Science and the Arts. The authors of this work take full responsibility for its content.
References
- 1. . On the limits of cross-domain generalization in automated X-ray prediction Proceedings of the Third Conference on Medical Imaging with Deep Learning, PMLR. 2020; 136–155. https://proceedings.mlr.press/v121/cohen20a.html. Accessed May 19, 2023.
- 2. . External validation of deep learning algorithms for radiologic diagnosis: a systematic review. Radiol Artif Intell 2022;4(3):e210064.
- 3. . Automatic evaluation of chest radiographs – the data source matters, but how much exactly? Rofo 2023;195(S 01):S36.
- 4. . Can we trust deep learning based diagnosis? the impact of domain shift in chest radiograph classification. In: Petersen J, San José Estépar R, Schmidt-Richberg A, , eds. Thoracic Image Analysis. TIA 2020. Lecture Notes in Computer Science, vol 12502. Cham, Switzerland: Springer, 2020; 74–83.
- 5. . Federated optimization: distributed machine learning for on-device intelligence. arXiv 1610.02527 [preprint] https://arxiv.org/abs/1610.02527. Posted October 8, 2016. Accessed November 21, 2022.
- 6. . Federated learning: strategies for improving communication efficiency. arXiv 1610.05492 [preprint] https://arxiv.org/abs/1610.05492. Posted October 18, 2016. Accessed November 21, 2022.
- 7. . Communication-efficient learning of deep networks from decentralized data. arXiv 1602.05629 [preprint] https://arxiv.org/abs/1602.05629. Posted February 17, 2016. Accessed November 21, 2022.
- 8. . Encrypted federated learning for secure decentralized collaboration in cancer image analysis. Med Image Anal 2024; 92:103059. Previously published in medRxiv 2022.07.28.22277288 [preprint].
- 9. . Collaborative training of medical artificial intelligence models with non-uniform labels. Sci Rep 2023;13(1):6046.
- 10. . Adversarial interference and its mitigations in privacy-preserving collaborative machine learning. Nat Mach Intell 2021;3(9):749–758.
- 11. . Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging. arXiv 2302.01622 [preprint] https://arxiv.org/abs/2302.01622. Posted February 3, 2023. Accessed February 6, 2023.
- 12. . Reconstructing training data with informed adversaries. In:
2022 IEEE Symposium on Security and Privacy (SP) , San Francisco, Calif. IEEE, 2022;1138–1156. - 13. . Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I, eds.
Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science , vol 4052. Berlin, Germany: Springer, 2006; 1–12. - 14. . End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nat Mach Intell 2021;3(6):473–484.
- 15. . Deep learning with differential privacy. In:
CCS ‘16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , Vienna Austria. ACM, 2016;308–318. - 16. . Decision making with differential privacy under a fairness lens. In:
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , Montreal, Canada. International Joint Conferences on Artificial Intelligence Organization, 2021; 560–566. - 17. . Neither private nor fair: impact of data imbalance on utility and fairness in differential privacy. In:
PPMLP‘20: Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice , New York, NY. ACM, 2020; 15–19. - 18. . Differential privacy has disparate impact on model accuracy. In:
NIPS‘19: Proceedings of the 33rd International Conference on Neural Information Processing Systems . ACM, 2019; 15479–15488. - 19. . Addressing fairness in artificial intelligence for medical imaging. Nat Commun 2022;13(1):4581.
- 20. . VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data 2022;9(1):429.
- 21. . ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In:
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, 2017; 3462–3471. - 22. . CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc AAAI Conf Artif Intell 2019;33(01):590–597.
- 23. . Artificial intelligence for clinical interpretation of bedside chest radiographs. Radiology 2023;307(1):e220510.
- 24. . PadChest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal 2020;66:101797.
- 25. . Three naive Bayes approaches for discrimination-free classification. Data Min Knowl Discov 2010;21(2):277–292.
- 26. . MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 2019;6(1):317.
- 27. . Differentially private training of residual networks with scale normalisation. arXiv 2203.00324 [preprint] https://arxiv.org/abs/2203.00324. Posted March 1, 2022. Accessed January 3, 2023.
- 28. . Deep residual learning for image recognition. In:
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Las Vegas, NV.IEEE;2016; 770–778. - 29. . Group normalization. arXiv 1803.08494 [preprint] https://arxiv.org/abs/1803.08494. Posted March 22, 2018. Accessed December 15, 2022.
- 30. . Batch normalization: accelerating deep network training by reducing internal covariate shift. In:
ICML‘15: Proceedings of the 32nd International Conference on International Conference on Machine Learning , Lille, France. ACM, 2015; 448–456. - 31. . Mish: A self regularized non-monotonic activation function. In:
Proceedings of the 31st British Machine Vision Conference (BMVC 2020) . - 32. . Rényi differential privacy. In:
2017 IEEE 30th Computer Security Foundations Symposium (CSF) , Santa Barbara, CA. IEEE;2017; 263–275. - 33. . Bootstrapping and permuting paired t-test type statistics. Stat Comput 2014;24(3):283–296.
- 34. . On the compatibility of privacy and fairness. In:
UMAP‘19 Adjunct: Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization , Larnaca, Cyprus. ACM, 2019; 309–315. - 35. . EfficientNet: rethinking model scaling for convolutional neural networks. In:
Proceedings of the 36th International Conference on Machine Learning .PMLR, 2019; 6105–6114.
Article History
Received: June 15 2023Revision requested: Aug 10 2023
Revision received: Oct 31 2023
Accepted: Nov 14 2023
Published online: Dec 06 2023