Validation of A Method to Compensate Multicenter Effects Affecting CT Radiomics.

Background Radiomics extracts features from medical images more precisely and more accurately than visual assessment. However, radiomics features are affected by CT scanner parameters such as reconstruction kernel or section thickness, thus obscuring underlying biologically important texture features. Purpose To investigate whether a compensation method could correct for the variations of radiomic feature values caused by using different CT protocols. Materials and Methods Phantom data involving 10 texture patterns and 74 patients in cohorts 1 (19 men; 42 patients; mean age, 60.4 years; September-October 2013) and 2 (16 men; 32 patients; mean age, 62.1 years; January-September 2007) scanned by using different CT protocols were retrospectively included. For any radiomic feature, the compensation approach identified a protocol-specific transformation to express all data in a common space that were devoid of protocol effects. The differences in statistical distributions between protocols were assessed by using Friedman tests before and after compensation. Principal component analyses were performed on the phantom data to evaluate the ability to distinguish between texture patterns after compensation. Results In the phantom data, the statistical distributions of features were different between protocols for all radiomic features and texture patterns (P < .05). After compensation, the protocol effect was no longer detectable (P > .05). Principal component analysis demonstrated that each texture pattern was no longer displayed as different clusters corresponding to different imaging protocols, unlike what was observed before compensation. The correction for scanner effect was confirmed in patient data with 100% (10 of 10 features for cohort 1) and 98% (87 of 89 features for cohort 2) of P values less than .05 before compensation, compared with 30% (three of 10) and 15% (13 of 89) after compensation. Conclusion Image compensation successfully realigned feature distributions computed from different CT imaging protocols and should facilitate multicenter radiomic studies. © RSNA, 2019 Online supplemental material is available for this article. See also the editorial by Steiger and Sood in this issue.

radiology.rsna.org n Radiology: Volume 291: Number 1-April 2019 of features was derived from cohort 1 (42 patients with lung cancer; mean age, 60.4 years; age range, 31-81 years; Table 1) between September and October 2013 (9), including 19 men. All patients underwent a CT scan with the same machine and protocol (Table E4 [online]), and CT images were reconstructed by using three algorithms: filtered back projection and Sinogram Affirmed Iterative Reconstruction (Siemens Healthcare, Forchheim, Germany) with a noise reduction strength of level 3 (hereafter, referred to as S3) and 5 (hereafter, referred to as S5). For each patient, the dominant tumor lesion was segmented manually three times: twice by a radiologist and once by a technologist. For each of the three volumes of interest per patient and each reconstruction, 15 radiomic features were calculated. Five geometric features (volume, diameter, surface, sphericity, and compactness) were excluded from the analysis (F.O.) because they mostly depend on the segmentation.
A second set of features was obtained from cohort 2 (32 patients; mean age, 62.1 years; age range, 29-82 years; Table 1) between January and September 2007 (16 men) with a lung cancer who underwent two CT scans (Table E4 [online]) within 15 minutes (10). This dataset was originally collected in a clinical trial (NCT00579852) to evaluate the reproducibility of tumor volume and diameter measurements and is part of the Reference Image Database to Evaluate Therapy Response project (19). The CT images were reconstructed by using six protocols that combined two reconstruction algorithms (lung and standard) and three section thicknesses (1.25 mm, 2.5 mm, and 5 mm) (10). For one lesion per patient (29 primary and three metastatic lesions), a tumor volume of interest was obtained from a consensus among the manual segmentations by three radiologists. After resampling the volume-of-interest voxels to 0.5 3 0.5 3 0.5 mm 3 by using a trilinear interpolation, 89 radiomic features were calculated for the six imaging protocols (two reconstructions 3 three section thicknesses) and for each of the 64 scans (32 patients with two scans each).

Compensation Method
To correct for differences in features caused by the various imaging protocols, we used the ComBat function (https://github. com/Jfortin1/ComBatHarmonization) compensation method (15). This method has been used for cortical thickness measurements from MR images (20) and for radiomic features from different PET protocols (16). It is a data-driven method Summary Nonbiological differences related to CT scanner type can be removed from radiomic feature values, allowing radiomics features to be combined in multicenter or multivendor studies.

Key points
n Radiomic feature values obtained by using different CT imaging protocols or scanners can be corrected for the protocol or scanner effect by using the proposed compensation method.
n The use of realigned features will enable multicentric studies to pool data from different sites to build reliable radiomic models based on large databases.
n The proposed compensation method is easily available, fast, and requires neither phantom acquisition nor feature recalculation.
transformation to express all data in a common space devoid of batch effects. It has been shown (16) to be effective in PET to realign the radiomic feature distributions between three different protocols for healthy liver tissue and breast lesions, without altering the biologic information. The purpose of our study was therefore to determine whether this compensation method could also correct for the CT protocol effect by using phantom and patient data.

Materials and Methods
All patient data were anonymized and are publicly available in the supplemental data of Kim et al (9) and Lu et al (10). All authors had control of the data and information submitted for publication.

Phantom Experiments
The phantom data used in our study have been produced by Mackin et al (4) and are publicly available in the associated supplemental data (4). The Credence Cartridge Radiomics phantom consists of 10 layers with different materials corresponding to different texture patterns. This phantom was scanned by using 17 different imaging protocols from four medical institutes involving various reconstruction kernels, scan types, section thickness, pixel spacing, spiral pitch factor, and effective milliamperage. Additional information on phantom and acquisition characteristics are provided in Tables E1 and E2 (online). For each layer, 16 nonoverlapping volumes of interest with an average cubic volume of 8 cm 3 (range, 7.6-9 cm 3 ; corresponding to 2708-14 332 voxels depending on the imaging protocols) are also made available in Dicom-RTstruct format. For each volume of interest and each imaging protocol, we (F.O. and C.N., with 7 years and 20 years of research experience in medical imaging, respectively) computed 40 radiomic features by using the LIFEx freeware (17) (Inserm, Orsay, France, www.lifexsoft.org; Table E3 [online]) with a fixed bin size (18) set to 10 HU between -1000 HU and 3000 HU without any spatial resampling. We performed the radiomic analysis for 16 of 17 imaging protocols because of a reading issue with acquisition Credence Cartridge Radiomics 1-GE2.

Patients
Publicly available radiomic features from two patient databases (cohort 1 and cohort 2) were used in our study. The first set Note.-Unless otherwise indicated, data are the number of patients and data in parentheses are percentages. * Data in parentheses are range.
55 and 10 texture patterns were less than .05 before compensation (Tables 2, E6 [online]). Only one P value for skewness was greater than .05 for pattern 7 (dense cork; P = .46). After compensation, all P values of Friedman tests were greater than .05, which indicated that the protocol effect was no longer detectable. These results were visually confirmed by using the projection of the data in the space spanned by the first two principal components of principal component analysis. Figure 1 shows an overlap between textural patterns before ComBat because of the large variability of radiomic feature values computed from 16 different CT protocols. For each textural pattern (each color), several clusters corresponding to different CT protocols could be identified. After ComBat, textural patterns could be clearly distinguished and were no longer composed of different clusters, demonstrating that the compensation method properly corrected for the scanner effect while retaining the specific characteristics of each texture pattern. The variance explained by the first two components was higher after ComBat (65.6% vs 53.2%), with approximately the same features contributing to the first two principal components before and after compensation (data not shown).
On the basis of three CT acquisitions (GE1, P2, and S2; Table  E2 [online]), Figure 2 shows that when data were pooled without realignment, the sensitivity for distinguishing cork from dense cork was 67% (32 of 48 volumes of interest) with a specificity of 98% (47 of 48 volumes of interest) by using the cutoff maximizing the Youden index. After ComBat, both sensitivity and specificity were 100% (48 of 48 volumes of interest). For unbalanced groups, Figure E1 (online) shows that the compensation method also yielded a perfect distinction between these two patterns.

Patient Data
For patient cohorts 1 and 2, 100% (10 of 10) and 98% (87 of 89), respectively, of Friedman tests had P values less than that identifies the protocol effect assuming that the value of each feature, y, measured in volume of interest, j, with imaging protocol, i, can be written as where a is the average value for feature y ij , g i is an additive protocol effect, and i E is a multiplicative protocol effect affected by an error term (´i j ). The compensation consists in estimating the model parameters a, g i, and i E by using a maximum likelihood approach on the basis of the set of available observations:

Statistical Analysis
Statistical analysis was performed by using software (R; R foundation for Statistical Computing). To determine whether the protocol setting (independent variable i in Equation [1]) affected the distributions of radiomic feature values (dependent variables y ij in Equation [1]), we (F.O. and F.F., with 30 years of experience) performed two-sided Friedman tests before and after ComBat compensation for each feature as summarized in Table E5 (online). The null hypothesis is that there is no difference between the distributions. Benjamini-Hochberg procedure was used to control the false discovery rate (21). P values less than .05 indicated statistical significance. Because the goal of the compensation is to realign the distributions in terms of mean and standard deviation, a P value of the Friedman test greater than .05 indicated that the realignment was successful.
For the phantom data, we also performed a principal component analysis of the 2560 samples (16 volume of interest 3 10 texture patterns 3 16 imaging protocols) described by 40 variables (radiomic features). Principal component analysis was performed before and after ComBat to view the effect of the compensation method on the distinction between patterns. We also studied whether two textural patterns could be distinguished when pooling data from the three imaging protocols before and after compensation and for balanced and unbalanced groups.

Results
Patient characteristics are shown in Table 1.

Phantom Experiments
In the phantom data, 399 of 400 P values of the Friedman tests performed for all features on the basis of 16 imaging protocols tributions when Friedman tests remained statistically significant after ComBat showed that the residual difference between protocols was always small and that the protocol effect was reduced (Fig E2 [online]), demonstrating the effectiveness of .05 between imaging protocols before ComBat (Tables 2, E7,  E8 [online]). After ComBat, 30% (three of 10) of P values for cohort 1 and 15% (13 of 89) of P values for cohort 2 were less than .05. Visual inspection of the radiomic feature value dis-   Table E2 [online]). When pooling all radiomic feature values, the optimal cutoff could not perfectly distinguish the patterns; a perfect distinction was observed after compensation of scanner effects. Se = sensitivity, Sp = specificity.

57
the plot shows a shift in distribution with greater homogeneity values for reconstruction algorithm S5 than for S3 and filtered back projection. This was expected because reconstruction S5 involved higher noise reduction. After compensation, the dis-the compensation. ComBat corrected the protocol effect with a realignment of feature values among the three protocols for cohort 1 and among the six protocols for cohort 2 (homogeneity feature in Fig E3 [online]). For example, before ComBat, and scanners, and the actual effect on diagnostic performance on clinical data needs to be demonstrated. Independent multicenter validation of radiomic models is also essential for them to become mainstream (1,25,26).
In conclusion, ComBat makes it possible to pool radiomic features from different CT protocols. This method appears promising to address the center effect in multicenter radiomic studies and to possibly raise the statistical power of those studies. ComBat is data driven, which means that the transformations identified by ComBat to set all data in a common space should be estimated for each study involving data from different centers and protocols. Our analysis was on the basis of less than 50 patients for each acquisition protocol, which demonstrated the efficiency of the method even for small patient cohorts. By using simulations in which we gradually removed patient data (results not shown), we found satisfactory results with as few as 20 patients per imaging protocol. The minimum number of patients required per imaging protocol to successfully apply ComBat remains to be comprehensively investigated. tributions between the three reconstructions better overlapped. Figure 3 shows three examples of realignment of features between different reconstruction algorithms, reconstruction kernels, and section thicknesses.

Discussion
Radiomic features are sensitive to the acquisition and reconstruction parameters of CT images. Feature values are therefore not directly comparable between different imaging protocols, which limits their use in multicenter studies. We demonstrated that the ComBat method can realign radiomic features computed from different CT imaging protocols. By using phantom data, we showed that ComBat removed the scanner and protocol effect while preserving the differences between texture patterns. The correction for the scanner effect was confirmed by using patient images reconstructed with different imaging protocols.
The use of this compensation method should facilitate multicentric radiomic analyses that are needed to demonstrate the practical usefulness of radiomic features for patient care. Data standardization is a subject of interest in the international imaging community, with increasing awareness of the need to reduce the variability in image quality between centers and machines (22,23). ComBat offers a solution to realign radiomic features with several advantages. ComBat is easily available to all and fast (a function available for free in R software; R Project for Statistical Computing). The transformations are estimated on the basis of the measured feature values, without the need to go back to images or to perform phantom experiments. No learning set is needed. ComBat does not change the feature definitions (6,11) and can therefore be used with all software and algorithms and with any radiomic features. This is illustrated in three data sets by using three different implementations for the radiomic feature calculation (Table E5 [online]). ComBat does not require spatial resampling of the CT images to a single pixel size and/or image filtering (5). It is applicable when radiomic features values are available while images are not available. ComBat can account for covariates of interest if the patients scanned with different imaging protocols do not have the same characteristics (eg, different age distributions). ComBat can model these covariates in the compensation process, as illustrated in PET imaging with different proportions of cancer subtypes in different departments (16), if enough patients with the same characteristics are available.
As demonstrated by using the phantom data (Table 2), before compensation the values of radiomic features were significantly different between imaging protocols for a given pattern. Ignoring this effect in multicentric studies might bias the findings and reduce statistical power. A recent study (24) showed that the four features selected in Aerts et al (13) to build the radiomic signature were correlated with tumor volume, which might explain why the model remained robust on data from different centers. The use of the ComBat method might help to determine whether radiomic features reflecting the lesion biologic heterogeneity but affected by the center effect more than the lesion volume also have some predictive value.
Our study had limitations. Our findings should be confirmed by using other cancer types for other imaging protocols