Reviews and CommentaryFree Access

The Long Route to Standardized Radiomics: Unraveling the Knot from the End

Published Online:https://doi.org/10.1148/radiol.2020200059

See also the article by Zwanenburg et al in this issue.

Prof Kuhl is the chair of the Department of Diagnostic and Interventional Radiology at RWTH Aachen University. She is a board-certified radiologist, interventional radiologist, and neuroradiologist, with a main interest in oncologic imaging and interventional oncology. She served and serves as principal investigator for several large clinical trials, authored numerous research papers, and received a multitude of awards, including the European Magnetic Resonance Award.

Dr Truhn completed his training as a radiologist in the Department of Radiology at the University of Aachen, after studying physics and medicine in London and Aachen. He is currently pursuing a research fellowship at the Institute of Imaging and Computer Vision. He received the Gladys Locke Prize of Imperial College and the Borchers Badge of RWTH Aachen. His research interest focuses on applications of machine learning in radiology.

In this issue of Radiology, the Image Biomarker Standardization Initiative (IBSI) presents a set of consensus-based reference values for radiomics image features (1) to calibrate and validate radiomics software.

Why is this important? What is radiomics analysis anyway?

When we as radiologists assess what we observe on a CT or MRI scan, we rely on experience. For example, we judge a tumor based on what we have picked up through books and mentors but most of all through our own personal clinical experience. With this approach, radiology has become a success story that is increasingly indispensable to modern medicine. As humans, we must rely on building memories of the radiologic appearance (or imaging phenotype) of diseases according to the cases we have seen and then use pattern recognition to establish correct diagnoses. Yet, the amount of clinical experience we can gather during a lifetime as well as the amount of information passed on to us from former generations of radiologists is limited. Moreover, humans have varying abilities for pattern recognition. Differentiating a simple cyst from an invasive tumor usually poses no challenge. However, if the number of similar cases we have seen before is limited or the differentiating imaging pattern is more subtle, then we may struggle. Differentiating a small hepatocellular carcinoma from a regenerative liver nodule is not always possible. Beyond these typical diagnostic challenges, ample opportunity exists to expand the current role of radiology. For example, can subtle differences of imaging phenotypes be used to distinguish between tumors that are likely or not likely to shed metastases?

This is where radiomics analyses promise to help. Instead of relying on inaccessible databases of an imaging phenotype in radiologists’ heads, mathematical formulas translate an image (or rather a manually outlined—ie, segmented—part thereof) into a set of numbers (typically a few hundred), each describing specific image features of a disease. These features can be stored and used to build comprehensive, machine-readable, open-accessible databases of imaging features of this specific disease entity (2). The principle of radiomic analysis is thus to parse image information into a set of machine-readable, quantitative, so-called radiomic features. The distribution or profiles of these features are then correlated with specific validated diagnoses or outcomes to identify radiomic profiles or patterns characteristic of specific diseases. Once this is accomplished, computers should be able to characterize or classify disease states. Such radiomic analyses factor in a broad variety of quantitative lesion features, which should allow for a more subtle and more reproducible analysis of a tumor’s features than human interpretation.

In principle, extracting quantitative data from images is not new. Radiologists have long used quantitative numbers extracted from images to uncover patterns in data (eg, lung nodule size) or the enhancement rates of lesions at CT or MRI (3). In theory, every hospital could contribute to comprehensive radiomic databases, which could then be exploited to assist in diagnosis or to find subtle patterns that might elude human readers. So not surprisingly, hopes were high that the all-embracing approach of casting as much information as possible into exploitable quantitative numbers would lead to radiomic breakthroughs in disease prognostication, treatment prediction, and tumor grading (4). The optimistic belief was that whenever an imaging pattern differentiates between two disease entities (even if unnoticeable to the human eye), it could be uncovered by radiomic analyses. However, unlike deep learning (DL) through neural networks, the success of radiomic approaches remained behind expectations.

In our own research, radiomic approaches were inferior to neural networks when characterizing breast lesions (5). Moreover, promising results using radiomic approaches are published but often are not reproducible if the same algorithms are used in different environments. The reasons for this are manifold. For instance, researchers might report seemingly significant differences of radiomic features between two groups of diseases, but these differences were attributable to pure chance. When hundreds of radiomic features are correlated with a given outcome, some features will likely correlate accidentally with the condition in question. Fortunately, these practical errors have become less frequent as the field of radiomic analysis matures. Still, research in radiomics is plagued by the problem that quantitative results of radiomic analyses differ between research groups. One important reason for this lack of reproducibility is lack of standardization, which is addressed by the current article.

Between image acquisition and extraction of radiomic features, a multitude of choices can influence the variability of radiomic output and affect reproducibility. Using CT as an example, image acquisition parameters must first be selected. For example, reducing the tube current increases noise and influences radiomic texture features. Second, the reconstruction algorithm needs to be chosen. Reconstruction algorithms differ both between and within a vendor. Choosing, for instance, iterative reconstruction will usually result in a smoother image, influencing texture and margin features. Third, the area of the abnormality must be indicated. This manual segmentation is another source of variability because different radiologists may define different lesion outlines, again influencing texture and shape-related features. Fourth, the software used to extract the radiomic features will affect the resulting numbers.

To unravel this knot, one needs to start from the end and first standardize the software framework used to extract radiomic features. This is the aim of the IBSI in this issue of Radiology.

Twenty-five teams were each given a standardized set of images (one digital phantom and one patient CT scan) with associated segmentations. Thus, variability associated with image acquisition, reconstruction, and segmentation (steps one to three as described earlier) that could confound the value of the respective radiomic features were excluded. Only variability due to different implementations of the radiomic algorithm remained.

A real-world scenario was simulated where each research group used its own individual software environment to calculate radiomic features. Features were communicated with use of a short descriptor (eg, major axis length). A list of 174 features was distributed to the participating groups. Each group was then challenged to use their own radiomic algorithm to calculate these features and to return and compare values in an iterative process until a consensus on the values between the groups was reached. Ultimately, the study found that 169 features yielded at least moderate agreement. To validate these features (ie, to determine whether they would yield reproducible quantitative results), the consortium calculated the 169 radiomic features in a small data set of CT, MRI, and PET images of 51 patients with soft-tissue sarcoma. From the 169 preselected features, 167 demonstrated good to excellent reproducibility.

These validated radiomic reference values will allow the scientific community to test and validate individual radiomic algorithms. Through this process, it is hoped that more reproducible and comparable results will be obtained in future radiomic research. However, as the authors acknowledge, not all groups implemented all of the radiomic features specified by the IBSI, thus limiting the testing of more exotic radiomic features. Nevertheless, this work is an important first step for identifying a stable and reproducible set of radiomic features. It has already been shown that only 71 out of a set of 177 radiomic features were reproducible after a comprehensive inter- and intra-CT image acquisition analysis (6). We hope that the results of this IBSI endeavor, together with further research on how radiomic features vary depending on variable segmentation by different radiologists, will result in a common canon of stable and reproducible features for use in clinical research. In any case, this will be a long and stony road.

In view of this plethora of challenges and difficulties, the question that arises is whether radiomic analyses are here to stay or whether DL with convolutional neural networks (their artificial intelligence sister technology) will prevail. Although both radiomics analyses and DL belong to the broad field of machine learning, they differ in crucial aspects. As explained, for a radiomic analysis, digital data from a predefined, radiologist-delineated region of an image are used to calculate radiomic features according to manmade, predefined mathematical formulas. However, DL requires neither a manual segmentation of a lesion nor the definition of mathematical formulas. Rather, for DL, a computer algorithm uses a large pool of digital images with validated diagnoses that are annotated to contain or not to contain a given disease or condition. When the training pool of images is sufficiently large, the algorithm independently “learns” to distinguish disease from nondisease by contrasting images known to contain disease with images known not to contain disease. Accordingly, for DL, no individual features are separated or calculated; rather, the algorithm itself develops appropriate criteria that are useful to distinguish “condition present” from “condition absent.” The output of such an algorithm is an overall probability that a disease state is present. The challenge with DL is that it requires large data sets with validated diagnoses (ground truth) typically found in screening applications, for example. DL has already been shown to outperform experienced breast radiologists in the interpretation of mammographic screenings (7).

Whether it is radiomic analyses, DL, or both approaches that will be exploited for different clinical applications, computer algorithms will soon expand and augment our ability to care for patients.

Disclosures of Conflicts of Interest: C.K.K. disclosed no relevant relationships. D.T. disclosed no relevant relationships.

References

  • 1. Zwanenburg A, Vallières M, Abdalah MA, et al. The Image Biomarker Standardization Initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology 2020;295:328–338.
  • 2. Aerts HJ. The potential of radiomic-based phenotyping in precision medicine: A review. JAMA Oncol 2016;2(12):1636–1642.
  • 3. MacMahon H, Naidich DP, Goo JM, et al. Guidelines for management of incidental pulmonary nodules detected on CT images: From the Fleischner Society 2017. Radiology 2017;284(1):228–243.
  • 4. Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding tumor phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 2014;5(1):4006.
  • 5. Truhn D, Schrading S, Haarburger C, Schneider H, Merhof D, Kuhl C. Radiomic versus Convolutional Neural Networks Analysis for Classification of Contrast-enhancing Lesions at Multiparametric Breast MRI. Radiology 2019;290(2):290–297.
  • 6. Berenguer R, Pastor-Juan MDR, Canales-Vázquez J, et al. Radiomics of CT Features May Be Nonreproducible and Redundant: Influence of CT Acquisition Parameters. Radiology 2018;288(2):407–415.
  • 7. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577(7788):89–94.

Article History

Received: Jan 7 2020
Revision requested: Jan 13 2020
Revision received: Jan 14 2020
Accepted: Jan 15 2020
Published online: Mar 10 2020
Published in print: May 2020