Extracranial Soft-Tissue Tumors: Repeatability of Apparent Diffusion Coefficient Estimates from Diffusion-weighted MR Imaging

Jessica M Winfield, PhD1,2, Nina Tunariu, MD1,2, Mihaela Rata, PhD1,2, Keiko Miyazaki, PhD1,2, Neil P Jerome, PhD1,2, Michael Germuska, PhD1,2,a, Matthew D Blackledge, PhD1,2, David J Collins, BA1,2, Johann S de Bono, MD, PhD3,4, Timothy A Yap, MD, PhD3,4, Nandita M deSouza, MD1,2, Simon J Doran, PhD1,2, Dow-Mu Koh, MD1,2, Martin O Leach, PhD1,2, Christina Messiou, MD1,2, and Matthew R Orton, PhD1,2 1Cancer Research UK Cancer Imaging Centre, Division of Radiotherapy and Imaging, The Institute of Cancer Research and Royal Marsden Hospital, 123 Old Brompton Road, London. SW7 3RP. UK


Introduction
Body diffusion-weighted magnetic resonance imaging (DW-MRI) is well-established as a qualitative and quantitative technique in oncology (1). The simplest quantitative metric derived from DW-MRI is the apparent diffusion coefficient (ADC), which is estimated by fitting a mono-exponential curve to the measured signal at two or more diffusion weightings (b-values). Baseline ADC estimates or post-treatment changes in ADC have been shown to be indicative of response to chemotherapy and/or chemoradiation therapy in many tumor types, including rectal adenocarcinoma (2), hepatic metastases of colorectal (3) and gastric cancers (4), cervical cancer (5), breast cancer (6), head-and-neck squamous cell carcinoma (7), ovarian cancer (8), and non-small cell lung cancer (9).
As with all quantitative metrics, the repeatability of ADC estimates determines the ability of the technique to detect treatment-induced changes, thereby influencing the number of patients required for clinical trials and determining the size of post-treatment changes that can be detected in individual patients. Repeatability is usefully defined as "closeness of the agreement between the results of successive measurements of the same measurand carried out under the same conditions of measurement" (10) where, in imaging studies, repeatability conditions include use of the same scanner, imaging protocol, observers, and repetition after a short interval (typically 1 hour to 7 days). In DW-MRI studies that report ADC estimates, the "measurand" is usually the mean or median of ADC estimates from voxels in a tumor. On the other hand, reproducibility may be defined as "closeness of the agreement between the results of measurements of the same measurand carried out under changed conditions of measurement" (10) e.g. using a different MR scanner. The inter-scanner reproducibility of ADC estimates is particularly important in multi-center studies where it has been shown that good quality diffusion-weighted images with reproducible ADC estimates across platforms can be obtained following careful optimization of imaging protocols (11).
Exploratory DW-MRI studies in clinical trials often incorporate ADC repeatability estimates, usually by obtaining two baseline examinations with the second examination during the same visit (so-called 'coffee-break' repeatability study) or at a second visit one or more days later. The requirement for two baseline examinations increases the burden on patients, which may reduce recruitment or retention rates, and requires additional scanner time and resources, which may be difficult to accommodate in busy radiology departments. It would be advantageous to estimate ADC repeatability from previous studies, but this would only be feasible if repeatability was broadly the same across studies, despite variations in imaging protocol, tumor type or patient cohort; large differences in repeatability would argue strongly for study-specific repeatability estimates. The variety of repeatability metrics reported in the literature hinders comparison between studies and a framework for assessment of the technical performance of quantitative imaging biomarkers has been proposed by the Radiological Society of North America (RSNA) Quantitative Imaging Biomarkers Alliance (QIBA) (12,13). The QIBA framework recommends reporting repeatability using the within-subject standard deviation, limits of agreement, repeatability coefficient, intraclass correlation coefficient, and within-subject coefficient of variance; QIBA also emphasise the importance of reporting measurement conditions. A detailed investigation of ADC repeatability across a wide range of studies using the QIBA framework is therefore desirable.
The aim of this study was to assess ADC repeatability using the framework proposed by QIBA in extra-cranial soft-tissue DW-MRI studies to investigate whether ADC repeatability differs between studies carried out using different imaging protocols and patient populations over a period of 10 years at a single institution.

Study population
Nine patient studies and one healthy volunteer study were included in this analysis. All studies were approved by relevant National Research Ethics Committees. All patients and volunteers gave their written consent to participate in the studies. Only repeatability data from double-baseline examinations are reported here; post-treatment changes were outside the scope of this study but have been reported in the literature for some studies (14)(15)(16)(17)(18). Tables 1 and 2 describe the subjects and DW-MRI protocols for each study (labelled A to K); further information is available in the references given. All studies were carried out at 1.5T using Siemens MAGNETOM Avanto or Aera MR scanners (Table 2). In studies where the imaging study or ADC repeatability study formed a subset of the total cohort, only patients contributing to the ADC repeatability results are reported (studies C and G). In multi-center studies, only data from our center are reported (studies D, E, and K). In studies including intra-cranial and extra-cranial tumors, only extra-cranial data are reported (studies A and F). One result (coefficient of variation of ADC median in study K) has been reported previously (11) but other results from study K were not reported previously. No other results presented here have been reported previously, as publications from the original studies included data from intra-cranial tumors (14,15,19) or data from other centers (17), which are excluded from this analysis.

Image and data analysis
A total of 141 tumors/healthy organs were included in this analysis. All DW-MRI data were fitted using in-house software (Adept, The Institute of Cancer Research, London; or Matlab, Mathworks, Natick, MA). Regions of interest (ROIs) were drawn as described in Table 1. Software, methodology, and observers were fixed within each study; differences between studies reflect changes in technology and personnel (Table 1).
For each tumor/healthy organ, all fitted pixels in the ROIs were combined to create a volume of interest (VOI). Median and mean ADC (ADC median and ADC mean ) were estimated for each VOI. Bland-Altman plots of untransformed data show a tendency for differences between pairs of baseline measurements to scale with their ADC value (see Supplemental Material, Figure 5 [online]), in which case it is recommended (13,20) that repeatability (and changes due to treatment) be quantified using a proportional, i.e. ratio-based, measure so that the same measure applies across the range of ADCs encountered. This can most easily be achieved by using the natural logarithm of the data (12,13,(20)(21)(22), and this was done for all statistical analyses in this study. A paired t-test was used to assess whether there was a significant difference between the first and second baseline measurements in each study.
Repeatability was assessed using the methods recommended by QIBA (13). The withinsubject standard deviation (s W ) of the log-transformed ADC estimates was estimated according to Eq. 1, where d i is the difference between two baseline estimates of log(ADC median ) or log(ADC mean ) for the i th VOI, and N is the number of VOIs.
The within-subject coefficient of variation (CoV) (23), 95% limits of agreement (LoA), and repeatability coefficient (RC), which depend only on s W , were estimated according to Eqs. The intra-class correlation coefficient (ICC) was estimated according to Eq. 5, where s B is the between-subject standard deviation.
is the withinsubject mean squares, K is the number of replications (K=2 for all studies in this analysis), Y ik is the observed value of log(ADC median ) or log(ADC mean ) for the i th VOI at the k th replication, Ȳ i is the average over replications for the i th VOI, and Ȳ is the grand mean of log(ADC median ) or log(ADC mean ) over all observations (24).
The 95% confidence intervals (CI) for s W were estimated as of the χ 2 distribution with N degrees of freedom (24).
95% CI for ICC were estimated as is the p th centile of the F distribution with d 1 and d 2 degrees of freedom (25).
In addition to analysis of each study individually, VOIs were grouped into small, medium, and large, regardless of study (i.e. smallest 1/3, middle 1/3 and largest 1/3 of VOIs) and repeatability assessed for the three groups (47 VOIs per group). Finally, VOIs were aggregated from all studies and repeatability assessed for 141 VOIs together.
Levene's test for homoscedasticity (LeveneAbsolute, vartestn, Matlab 2016a) was used to assess whether repeatability differed between studies (24). Baseline differences were calculated for each VOI for log(ADC mean ) and log(ADC median ) and Levene's test used to assess whether the variance of the differences was the same for all studies; Levene's test was also used to assess whether repeatability differed between small, medium, and large VOIs.
Pearson's linear correlation coefficient (Matlab 2016a) was used to assess correlation between CoV and the year the study started, the number of VOIs in the study, and the median volume of VOIs in the study.

Results
The repeatability of ADC mean was similar to the repeatability of ADC median in all studies (Tables 3 and 4 [online]); for clarity, only ADC median is shown in Figures 1 to 4. Bland-Altman plots showed no relationship between differences between pairs of baseline measurements and their means ( Figure 1). None of the studies showed a significant difference between pairs of baseline measurements (paired t-test, p>0.05). The repeatability of ADC median (Table 3) and ADC mean ( Table 4 [online]) was good with CoVs between 1.7% and 6.3% for ADC median and between 1.7% and 6.5% for ADC mean for all studies ( Figure  2). Aggregating VOIs from all studies, CoV was 4.1% for ADC median and 3.9% for ADC mean , with upper and lower 95% LoA of 12.1% and -10.8% respectively for ADC median and 11.5% and -10.3% for ADC mean . Levene's test showed a significant difference between studies (p=0.01 for ADC median and ADC mean ), which did not persist after excluding the study with the lowest CoV (study B, which included some of the largest VOIs).
There was no correlation between the CoV and the year the studies started (Figure 3a, r=-0.4, p=0.2 for ADC median ; r=-0.3, p=0.3 for ADC mean ) nor between the CoV and the number of VOIs in each study (Figure 3b, r=-0.3, p=0.3 for ADC median ; r=-0.4, p=0.2 for ADC mean ). Only weak correlation was demonstrated between the CoV and the median VOI volume in each study (Figure 3c, r=-0.5, p=0.1 for ADC median and ADC mean ), although the CoV is noticeably lower in one study with very large tumors (study B) compared with other studies. Grouping into small, medium, and large VOIs showed a significant difference in ADC repeatability between sizes (Levene's test, p=0.02 for ADC median ; p=0.04 for ADC mean ) with the lowest CoV for large VOIs (Figure 4). Although 19 VOIs in the 'large' group were from study B, the majority (28 VOIs) were from other studies.

Discussion
The excellent repeatability of ADC median and ADC mean (CoV between 1.7% and 6.5% in all studies) demonstrates that ADC is a robust metric in clinical practice in oncology.  (26). A study in head-and-neck squamous cell carcinoma reported a RC of 15% for ADC mean (27). A study of hepatocellular carcinoma reported a CoV of 8.3% and lower and upper LoA of -41.1% and 18.6% respectively for ADC mean (28). In healthy volunteers, a study in abdominal organs reported RCs between 6.4% and 9.6% for ADC mean (29). A study of normal thyroid glands in healthy volunteers, which also followed the QIBA framework, reported s w 2 =0.0147×10 -3 mm 2 s -1 , RC=0.3355×10 -3 mm 2 s -1 , ICC=0.9273, and CoV=9.88% using reduced-field-of-view DW-MRI (30). Comparison between published studies is not straightforward since they report different repeatability metrics but each result is similar to the present analysis for their respective metrics; however, most studies do not report CIs, which further hinders comparison.
The CoV and LoA, expressed as percentages, may be more intuitive for investigators to understand, compared with s W or RC expressed on a log scale. Although the ICC is listed in the QIBA framework for reporting repeatability, ICC may not be an appropriate metric for comparison between studies as results are scaled to the inter-subject variability of the study cohort via s B ; a low ICC may therefore reflect a homogeneous cohort rather than poor repeatability (13). This is exemplified in study K where ICCs are low (ICC 0.126 to 0.677 in studies K1, K2, and K3) despite CoVs being comparable to other studies. Values of s B are an order-of-magnitude lower than in studies A to J, reflecting the narrow range of ADC estimates in healthy organs in the tightly-controlled volunteer cohort. These results strongly suggest that the ICC should not be used to compare ADC repeatability between studies.
Knowledge of ADC repeatability is essential for assessment of post-treatment changes in an individual patient (as opposed to cohort changes, which can be assessed using a t-test, or similar); knowledge of measurement repeatability is also essential in power calculations to estimate the sample size necessary to detect a treatment effect in prospective cohort studies.
Considering changes in ADC post-treatment, an increase of 12% or more in ADC median or ADC mean would be outside the 95% LoA for all VOIs analysed together -even considering the studies with the poorest repeatability (i.e. 'worst-case' studies), an increase of 20% would have been outside the 95% LoA in all studies. A tumor exhibiting such a change in ADC after treatment would therefore be assessed as exhibiting a post-treatment effect outside the expected variation of repeated measurements, with 95% confidence, when measured on the same scanner using the same imaging protocol, operator, and reader i.e. under repeatability conditions. This can be compared with post-treatment changes reported elsewhere: 23% and 24% increases in ADC mean in responding patients with hepatic metastases of colorectal (3) and gastric cancers (4), respectively; and increases of 20% (ADC mean ) and 22% (ADC median ) in responding ovarian cancer patients treated with platinum-based chemotherapy (8). In studies reporting ADC changes in individual patients, as opposed to cohort changes, post-treatment increases in ADC mean up to 100% were reported in cervical cancer patients following chemoradiotherapy (5) and increases in ADC mean up to 50% were reported in patients with non-small cell lung cancer (9), thus the excellent repeatability demonstrated in the present analysis shows that ADC is sensitive to changes that are observed in clinical studies.
The significant difference between small, medium, and large VOIs shows that volume is an important factor in ADC repeatability. The weak correlation between the CoV and the median VOI volume in each study may reflect the range of tumor sizes within each study. The low CoV of 1.7% in study B may relate to the large tumors in this study. For future studies, the assumption of a CoV of 6.5% would be a conservative choice. It is worthwhile noting that the VOIs did not always encompass the whole tumor: ROIs were drawn around the whole area of the tumor/healthy organ on at least three slices in all studies, but studies A, B, and E included considerably more slices. Larger VOIs may provide more robust estimates of ADC median and ADC mean due to larger sample sizes. Furthermore, larger tumors may be less affected by motion or partial volume effects, which may lead to better ADC repeatability. ADC repeatability in paediatric patients (study F) was not worse than other studies, despite the additional challenges associated with patient compliance in this group.
The apparent absence of a relationship between the CoV and the year the study commenced may suggest that ADC repeatability has not changed markedly over 10 years despite advances in scanner technology and imaging protocol methodology during that time. This suggests that ADC repeatability assessments from older studies may inform future studies, although this may not apply across substantial changes in hardware/methodology, such as a change in field strength. Whilst this analysis only considered ADC repeatability, imaging protocol variations may also affect overall image quality, qualitative interpretation, and absolute values of ADC estimates, but these effects are outside the scope of this analysis. Reasons for variations in imaging protocols include changes in hardware and software capabilities; advances in knowledge; requirements for imaging particular patient cohorts, such as size of field-of-view or orientation of imaging plane; requirements of study sponsors; and requirements to match protocols in multi-center studies.
The apparent absence of a relationship between the CoV and the number of VOIs in the study (over the range 6 to 26 VOIs) may suggest that an informative estimate of repeatability may be obtained from as few as 6 patients, indicating that double-baseline examinations from relatively small subsets of patients may be used to efficiently estimate repeatability for larger studies. Repeatability studies may thus be easily conducted if a center wishes to assess its DW-MRI protocols. Inclusion of larger numbers of subjects, however, allows narrower CIs to be placed on estimated quantities and is advocated in clinical trials.
Repeatability estimates for ADC median and ADC mean do not apply to all summary statistics, for example other ADC histogram centiles may exhibit poorer repeatability (31). Alternative acquisition techniques, e.g. motion compensation, would also require new repeatability studies. Furthermore, it is common practice to use data from previous imaging studies to develop novel analysis methods, which require assessment of repeatability of resulting metrics in order to evaluate their potential value in clinical practice. Double-baseline studies therefore provide an invaluable resource for future developments of analysis methods.
There are limitations to our analysis. First, all studies were carried out at a single expert center and senior members of staff with extensive experience of extra-cranial DW-MRI were involved in development of imaging protocols for all studies. Second, all but one of the studies were carried out on the same scanner, with the remaining study carried out on a scanner from the same manufacturer; the generality of our conclusions for test-retest measurements across scanners from other manufacturers remains to be tested. Third, only one healthy volunteer study was included. Fourth, many of the studies were sub-studies that formed part of a larger clinical trial and there may be selection bias due to inclusion/ exclusion criteria for these trials (e.g. including patients with lesions larger than 2cm, or excluding patients who had difficulty lying still). Generalization to routine clinical practice remains to be tested but the repeatability of ADC estimates in less controlled situations might be expected to be worse than the repeatability reported here.
In conclusion, ADC is a robust imaging metric which demonstrates excellent repeatability in extra-cranial soft-tissue DW-MRI studies across a wide range of tumor sites, sizes, patient populations, and imaging protocol variations. Estimates of ADC repeatability obtained from similar data can inform studies where double-baseline measurements are not possible, but a double-baseline format remains critical for future studies.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Implications for patient care
DW-MRI can be used to estimate ADC with good repeatability in extra-cranial softtissues, allowing a post-treatment increase of 12% or more in ADC to be distinguished.   CoV of ADC median for each study (A to K3); all VOIs analyzed together (All); and all tumor VOIs analyzed together (All tumors). Whiskers represent 95 % confidence intervals for CoV estimates.   CoV of ADC median for small, medium, and large VOIs, all VOIs together, and all VOIs excluding study B. Error bars represent 95 % confidence intervals of CoV estimates.    * Results are shown for each study and for all VOIs analysed together (denoted 'All'). † CoVs from K1, K2, and K3 reproduced from Winfield et al (11) for completeness.
Note: Estimates of ADC median for two baseline examinations for all tumors/organs are tabulated in the Supplemental Material (Table 5).