Fully Automatic Volume Measurement of the Spleen at CT Using Deep Learning

Published Online:https://doi.org/10.1148/ryai.2020190102

Abstract

Purpose

To develop a fully automated algorithm for spleen segmentation and to assess the performance of this algorithm in a large dataset.

Materials and Methods

In this retrospective study, a three-dimensional deep learning network was developed to segment the spleen on thorax-abdomen CT scans. Scans were extracted from patients undergoing oncologic treatment from 2014 to 2017. A total of 1100 scans from 1100 patients were used in this study, and 400 were selected for development of the algorithm. For testing, a dataset of 50 scans was annotated to assess the segmentation accuracy and was compared against the splenic index equation. In a qualitative observer experiment, an enriched set of 100 scan-pairs was used to evaluate whether the algorithm could aid a radiologist in assessing splenic volume change. The reference standard was set by the consensus of two other independent radiologists. A Mann-Whitney U test was conducted to test whether there was a performance difference between the algorithm and the independent observer.

Results

The algorithm and the independent observer obtained comparable Dice scores (P = .834) on the test set of 50 scans of 0.962 and 0.964, respectively. The radiologist had an agreement with the reference standard in 81% (81 of 100) of the cases after a visual classification of volume change, which increased to 92% (92 of 100) when aided by the algorithm.

Conclusion

A segmentation method based on deep learning can accurately segment the spleen on CT scans and may help radiologists to detect abnormal splenic volumes and splenic volume changes.

Supplemental material is available for this article.

Keywords: CT, Convolutional Neural Network (CNN), Oncology, Segmentation, Spleen, Volume Analysis

© RSNA, 2020

Summary

Automatic spleen segmentation using deep learning is feasible in complex scenarios, such as oncologic follow-up, and may aid radiologists in accurately assessing splenic volume change over time.

Key Points

  • ■ A deep learning segmentation method can robustly segment the spleen on CT scans in a heterogeneous dataset containing many abnormalities.

  • ■ The performance of the deep learning algorithm was comparable to an independent observer on the test set of 50 CT scans.

  • ■ An observer study showed that this algorithm may help radiologists when measuring splenic volume change.

Introduction

Splenic volume change (SVC) can occur as a result of infection, lymphoma, injury, variations in splenic vascularization, and other reasons (18). Full manual segmentation of the spleen in three dimensions is time-consuming and not feasible in clinical practice. Instead, visual estimation or an approximation equation is typically used by radiologists to assess the size of the spleen. To the best of our knowledge, there are no studies that have investigated whether substantial SVC goes undetected using these methods. During oncologic treatment, SVC can occur as an adverse effect of chemotherapy (9). A precise SVC may help clinicians in their treatment choices.

The first work in splenic volume approximation used the splenic index (10,11). The splenic index is calculated using the equation V = 30 + 0.58⋅DLH, where depth (D), length (L), and height (H) are two-dimensional measurements of the spleen in the axial or coronal plane. Figure 1 shows these measurements in two-dimensional sections of a CT scan. A precise three-dimensional segmentation can achieve an accurate volumetric measurement of the spleen. Methods such as multiatlas (1214), graph-cut (1315), active shape models (16), active contours (17), level-sets (18), and random forest (19) have been extensively used to segment the spleen.

An example of the two-dimensional measurements needed to compute the                     splenic index: depth in blue, height in red, and length in green. Note that                     depth and height do not necessarily have to be measured on the same transversal                     section.

Figure 1: An example of the two-dimensional measurements needed to compute the splenic index: depth in blue, height in red, and length in green. Note that depth and height do not necessarily have to be measured on the same transversal section.

In recent years, deep learning (DL) approaches—convolutional neural networks in particular—have achieved high performance in many areas of computer vision and have been successfully applied in medical imaging (2025). A sequence of convolutional layers is applied to the image to optimize segmentation tasks, every convolution can highlight different features, and combining these layers with pooling and nonlinear operations make these networks very powerful. For medical imaging, 2D U-Net (26), 3D U-Net (27), and variant architectures have been successfully used to segment structures and organs (2833). These architectures are based on a contracting path of convolutions followed by an expanding path of convolutions to produce voxel-wise predictions. The deepest convolutions learn global features, and the last convolutions obtain the fine segmentation prediction.

In an end-user comparison (34), three commercial systems showed that the liver and spleen segmentation volumes were fast and accurate, but the initial fully automatic segmentation failed for some cases and differed by 0.4%–9.8% from the final segmentation after correction for the remaining cases. The readers in this previous study took between 1 and 3 minutes on average to perform the corrections. The study showed that the performance of the fully automatic initial segmentation can be improved (34).

In this study, an automatic segmentation algorithm for the spleen was developed on a large dataset of thorax-abdomen CT scans from patients undergoing oncologic workup. Because these patients undergo various types of cancer treatment (eg, chemotherapy and/or radiation therapy) and are at different stages of disease, the images contained both local and widespread abnormalities throughout the scan. Our system was developed using a dataset of 400 CT scans (selected from 1100 patients) and tested using a dataset of 50 scans. Finally, a qualitative observer experiment with an experienced radiologist was conducted to assess whether the algorithm can help radiologists in assessing SVC in 100 patients (selected from 500 patients).

Materials and Methods

Patient Data

The data in this retrospective study were collected from Radboud University Medical Center. The institutional review board granted a consent waiver for the clinical images used in this study. We retrieved all thorax-abdomen CT studies referred from the oncology department between January 2014 and December 2017. In total, 7415 studies from 2386 patients (mean age, 58 years; range, 19–92 years; 54.7% women) were retrieved. Part of this dataset (918 CT scans from 918 patients) was previously used for a different study on developing an algorithm for organ localization (35).

We only included contrast material–enhanced CT scans in this study (n = 6972 studies). From the included data, we randomly selected 2150 CT scans from 1650 patients to obtain four datasets (A, B, C, and D) as depicted in Figure 2. As the patients in this dataset underwent an oncology workup, the scans typically presented multiple abnormalities, such as tumors, cysts, and lesions, which may alter the normal anatomy of the spleen. Additional information on CT imaging and datasets is described in Appendix E1 (supplement).

Flowchart shows the criteria to distribute the CT scans used in this                         study into datasets. Dataset A was used for training system A, and dataset A                         plus dataset B300 were used for training system B. Dataset C was used for                         testing systems A and B. Dataset D100 was used for the qualitative observer                         experiment. Note that dataset B300 and D100 are subsets of datasets B and D,                         respectively.

Figure 2: Flowchart shows the criteria to distribute the CT scans used in this study into datasets. Dataset A was used for training system A, and dataset A plus dataset B300 were used for training system B. Dataset C was used for testing systems A and B. Dataset D100 was used for the qualitative observer experiment. Note that dataset B300 and D100 are subsets of datasets B and D, respectively.

CT Imaging

CT scanners from two manufacturers were used to acquire the CT scans: Toshiba (Aquilion One) and Siemens (Sensation 16, Sensation 64, and Somatom Definition AS). The reconstruction kernels were FC09, FC09-H, B30f, B30fs, and I30f. The contrast agents used were iomeprol, iohexol, iobitridol, and iopromide (Imeron [Bracco Imaging], Omnipaque [GE Healthcare], Xenetix [Guerbet], and Ultravist [Bayer], respectively) with amounts varying between 15 and 140 mL. The section thickness ranged from 0.5 to 3 mm, with most (98.9%) having a section thickness of 1 or 2 mm.

Reference Standard Annotation

On all CT scans in the first training set (dataset A), the spleen was manually segmented by medical students using a tool developed in-house. Students were instructed to verify that the segmentation was correct on all transversal sections and to peer-review each other. The annotations included the splenic hilum if it was surrounded by splenic parenchyma. Dataset A was used as training data for the first system. Subsequently, we used the first system to obtain the final segmentations of dataset B. These final segmentations were used for selecting 300 additional scans for training of the second system; that gave us dataset B300. The final segmentations of the initial system on dataset B300 were manually corrected by the same medical students to train a second system. Later, dataset A plus dataset B300 were used for training a second system.

For the test set of 50 CT scans (dataset C), the same procedure was used to annotate the spleen in all scans; this was then used as the reference standard for testing the system later on. In addition, one medical student (herein referred to as “independent observer”) also annotated dataset C independently without consulting other students or the experienced radiologist. An experienced radiologist (E.T.S., > 30 years of experience in chest radiology) was consulted in difficult cases, performed a quality check, and adjusted (if necessary) the annotations of dataset A, B300, and C (reference standard).

Preprocessing and DL Network Settings for Automatic Spleen Segmentation

Values outside of the attenuation range (−500 to +400 HU) on the CT scans were clipped to discard unnecessary data for this task. The scans and reference masks were resampled to 1 × 1 × 1-mm resolution using cubic and nearest neighbor interpolation, respectively.

We used the 3D U-Net network (27) as the architecture of our system because it uses three-dimensional context to predict the results. This segmentation network and its two-dimensional variant reached high performance in multiple applications (26,28,29,3638). Because of the large memory footprint of the 3D U-Net, each scan was divided into patches. At the edges of the CT scan, mirroring was used as border handling when the patch covered an area outside the scan. Additional details on the inputs can be found in Appendix E2 (supplement).

Network Training

The network performance was evaluated after every epoch (in one epoch, every CT scan in the training set was used once) using the Dice score as the metric to select the optimal model. The training stopped when the mean Dice score stopped improving for 10 epochs. The optimal model of each network was used to evaluate the test set (dataset C).

We trained our first network from scratch using dataset A. The network was evaluated after every epoch using 30% of the training scans. The training stopped after 21 epochs to find the optimal model, which we referred to as system A. We used system A to process dataset B (n = 1000) to visually identify relevant cases. These relevant scans composed dataset B300 (see Appendix E4 [supplement]). We trained a new network from scratch using dataset B300 plus dataset A: segmentation system B. Thus, system B was trained using 400 scans. The network stopped training after 43 epochs. We evaluated both segmentation systems A and B on the test set (dataset C) of 50 scans.

Postprocessing for Automatic Spleen Segmentation

To produce the final segmentation results, each patch was processed separately, and the results were stitched together and thresholded at 0.5 to obtain binary results. Afterward, we applied connected components analysis and only retained the largest connected component. The output was then resampled back to the original scan resolution using nearest neighbor interpolation.

Qualitative Observer Experiment

To test the clinical usefulness (detection of growth or shrinkage of the spleen over time) of our segmentation system, we performed an observer study using an enriched set of cases. To define growth or shrinkage, we used a tolerance of ±25% in the SVC in this study. Thus, SVC of less than −25% was classified as shrinkage, and SVC of greater than +25% was classified as growth. Values within −25% to +25% were considered normal SVC.

We computed the SVC over time in dataset D (500 new patients) to obtain the enriched dataset D100 (100 patients). See Appendix E5 (supplement) for more details. The scan-pairs in dataset D100 were presented in a random order to an experienced radiologist in a dedicated workstation.

We considered three different reading modes for splenic volume change assessments (SVCa): visual SVCa, automatic SVCa, and assisted SVCa. See Appendix E6 (supplement) for more details.

A radiologist and a 4th-year resident defined the reference standard for dataset D100. They classified the scan-pairs visually as is currently done in clinical practice. In case of disagreement, a consensus meeting was held. The consensus reference standard was used to compare against visual SVCa, assisted SVCa, and automatic SVCa.

Statistical Analysis and Evaluation

Dice scores, relative absolute volume difference, maximum Hausdorff distance, and average symmetric surface distance (ASSD) were used to measure the similarity between the predictions and the reference masks. Per metric, we reported the mean, standard deviation, and two-sided 95% confidence intervals (CIs) (computed using 1000 random bootstraps). We computed the P values using the Mann-Whitney U test to test whether there was a statistical difference between the final system and the human observer (primary objective), and between the prototype and the final system (secondary objective). A P value less than .05 (two-tailed) was considered statistically significant. The metrics in this article can be found in Appendix E3 (supplement). We used an in-house developed Python 3.6 (https://www.python.org/) script to perform the statistical analysis.

Algorithm Availability

The segmentation algorithm can be tested online at http://grand-challenge.org/algorithms/spleen-segmentation/, in which interested readers can register and upload anonymized thorax-abdomen CT scans; an online workstation showing the segmentation overlays in three dimensions will be output.

Results

Comparison of Segmentation Methods

First, we compared our automatic spleen segmentation methods on the test set (dataset C) including the independent observer (Table 1). System A obtained a Dice score of 0.950 ± 0.040 (95% CI: 0.938, 0.959), system B obtained 0.962 ± 0.016 (95% CI: 0.957, 0.966), and the independent observer obtained 0.964 ± 0.012 (95% CI: 0.961, 0.967). Figure 3 shows boxplots comparing the evaluation metrics presented in Appendix E3 (supplement) among methods on the test set. The surface distance–based metrics (maximum Hausdorff, 95% Hausdorff, and ASSD) show that system B had fewer outliers than system A, whereas system B and the independent observer were comparable. Table 1 and Figure 3, B, show that the splenic volumes computed by the splenic index were not reliable. A Mann-Whitney U test was performed to compare the Dice score performance between system A, system B, and the independent observer. The difference between system A and system B (P = .019), and between system A and the independent observer (P = .011) were statistically significant, but the difference between system B and the independent observer was not significant (P = .834). Table 2 compares the performance of previous segmentation work. Table 2 shows that not all methods can assess abnormalities in the spleen. The relative absolute volume difference shows that the splenic index, Gloger et al (18), system B, and the independent observer obtained 16.56%, 6.30%, 4.39%, and 3.93%, respectively. In addition, the relative absolute volume difference between system B and the independent observer (used as reference) was 2.37% (5.17 mL). On the basis of all the metrics, system B outperformed system A. Therefore, we considered system B as the automatic SVCa for the qualitative observer experiment.

Table 1: Comparison of Performance among the Experiments on the Test Set

Table 1:
Boxplots show the performance of system A, system B, the independent                         observer, and the splenic index on the test set (dataset C). The methods are                         compared using, A, Dice score, B, relative absolute volume difference, C,                         maximum Hausdorff distance, D, 95% Hausdorff distance, and, E, average                         symmetric surface distance (ASSD). Mean and median values are depicted with                         black dashed and red lines, respectively. Note that C and D have two ranges                         for the y-axis to zoom-in to the boxplots body. Table 1 summarizes these                         results.

Figure 3: Boxplots show the performance of system A, system B, the independent observer, and the splenic index on the test set (dataset C). The methods are compared using, A, Dice score, B, relative absolute volume difference, C, maximum Hausdorff distance, D, 95% Hausdorff distance, and, E, average symmetric surface distance (ASSD). Mean and median values are depicted with black dashed and red lines, respectively. Note that C and D have two ranges for the y-axis to zoom-in to the boxplots body. Table 1 summarizes these results.

Table 2: Comparison between Our Best Performing System and Previous Work

Table 2:

Results of the Qualitative Observer Experiment

Comparison of SVCa.—For the qualitative observer experiment, the two readers who were selected to define the reference standard had disagreement in 13 scan-pairs, and a consensus meeting was held to define the final reference standard. In total, 59 cases were categorized as normal, 26 as growing, and 15 as shrinking in the reference standard. Table 3 compares the visual SVCa, automatic SVCa, and assisted SVCa assessments to the reference standard. The visual SVCa classified 81% (81 of 100) of the patients correctly. During the visual SVCa, the radiologist visually approximated the SVC classification in 80 of the 100 patients. In the remaining 20 of the 100 patients, the radiologist used the splenic index because the visual approximation was not evident. The automatic SVCa classified 89% (89 of 100) of the patients correctly. Finally, the assisted SVCa classified 92% of the patients correctly. In total, when observing the three-dimensional automatic segmentations and their volumes (ie, when going from visual SVCa to assisted SVCa), the radiologist changed the classification in 15% (15 of 100) of the patients. In 11 of these patients (73%, 11 of 15), this change resulted in the correct category in the reference standard. For five of these 15 patients, the SVC values were close to the threshold of 25% defined in this study (25.16%, 25.79%, 27.23%, 27.24%, and 27.44%).

Table 3: Comparison of the Visual SVCa, Assisted SVCa, and Automatic SVCa versus the Consensus-based Reference Standard (Dataset D100) in the Qualitative Observer Experiment

Table 3:

Figure 4 shows examples of the SVC analysis. Figure 4, A and B, shows patients with the minimum (−58%) and maximum (+140%) SVC, respectively. Figure 4, C, shows a patient where the radiologist changed their classification from normal to growth SVC after seeing our segmentations (assisted SVCa). The probable reason for this change is that the spleen grew proportionally in all the directions. Figure 4, D, shows a patient with −10% SVC computed by our method (automatic SVCa). In the visual SVCa, the radiologist classified this patient as shrinkage SVC because it looks small in the sagittal plane. In the assisted SVCa, the radiologist changed his classification from shrinkage to normal SVC.

Examples of splenic volume change (SVC) classification of scan-pairs                         from dataset D100 (200 CT scans from 100 patients) in the qualitative                         observer experiment. Sections surrounded by blue and orange rectangles show                         the automatic segmentations in the sagittal and coronal orthogonal views,                         respectively. A, B, Scan-pairs in which the visual SVCa and automatic SVCa                         classification match. A, The scan-pair with the largest negative SVC. B, The                         scan-pair with the largest positive SVC in the dataset. C, D, Scan-pairs in                         which the visual SVCa and automatic SVCa classification differ. C, The                         radiologist classified the scan-pair as normal SVC in the visual SVCa but                         changed it to growth SVC in the assisted SVCa after seeing the segmentations                         produced by automatic SVCa (system B). D, Similarly, the radiologist                         classified the scan-pair as shrinking SVC in the visual SVCa but changed it                         to normal SVC in the assisted SVCa. All the sections show 230 × 230 mm                         and have a window center of 60 HU and a window width of 360 HU. SVCa =                         splenic volume change assessment.

Figure 4: Examples of splenic volume change (SVC) classification of scan-pairs from dataset D100 (200 CT scans from 100 patients) in the qualitative observer experiment. Sections surrounded by blue and orange rectangles show the automatic segmentations in the sagittal and coronal orthogonal views, respectively. A, B, Scan-pairs in which the visual SVCa and automatic SVCa classification match. A, The scan-pair with the largest negative SVC. B, The scan-pair with the largest positive SVC in the dataset. C, D, Scan-pairs in which the visual SVCa and automatic SVCa classification differ. C, The radiologist classified the scan-pair as normal SVC in the visual SVCa but changed it to growth SVC in the assisted SVCa after seeing the segmentations produced by automatic SVCa (system B). D, Similarly, the radiologist classified the scan-pair as shrinking SVC in the visual SVCa but changed it to normal SVC in the assisted SVCa. All the sections show 230 × 230 mm and have a window center of 60 HU and a window width of 360 HU. SVCa = splenic volume change assessment.

Spleen segmentation ratings.—The independent radiologist visually rated the quality of the automatic spleen segmentations of 200 CT scans from 100 patients (dataset D100). The radiologist rated 87% (174 of 200) of the segmentations as excellent, 7% (14 of 200) as good, 3.5% (seven of 200) as bad, and 2.5% (five of 200) as failure. The radiologist grouped 94% (87% excellent and 7% good) of the segmentations as reliable segmentations. Figure 5 shows examples of this classification using probability maps in which black contours highlight the final output of the algorithm after postprocessing. Figure 5a shows a patient with large tumors in the liver and left kidney; the radiologist rated this segmentation as excellent. Figure 5b depicts a patient with a beavertail liver (enlarged liver attached to the spleen); this segmentation was rated as good because of a small error (<5 mm). Figure 5b shows a bad segmentation in which the algorithm did not perform well in the upper region of the spleen; this may be the result of the low contrast enhancement on this scan. Figure 5d depicts a segmentation failure caused by a dilated stomach.

Examples of (a) excellent, (b) good, (c) bad, and (d) failed                         segmentation of the qualitative classification performed by the radiologist                         on dataset D100 (200 CT scans from 100 patients, spleen masks are                         unavailable) as part of the qualitative observer experiment. The figures                         show raw probabilities (before postprocessing) obtained by system B on                         dataset B300. Red regions represent high probabilities (P ≥ 50%) of                         spleen presence. Green to transparent gradient regions represent low                         probabilities (P < 50%) of spleen presence. The black contour around                         the raw probabilities represents the final output after postprocessing that                         is used to compute the splenic volume. The cyan dashed lines and triangles                         point to mistakes. Coronal and axial planes are shown, but (d) shows coronal                         and sagittal planes for better visualization. All images have a window                         center of 60 HU and a window width of 360 HU.

Figure 5a: Examples of (a) excellent, (b) good, (c) bad, and (d) failed segmentation of the qualitative classification performed by the radiologist on dataset D100 (200 CT scans from 100 patients, spleen masks are unavailable) as part of the qualitative observer experiment. The figures show raw probabilities (before postprocessing) obtained by system B on dataset B300. Red regions represent high probabilities (P ≥ 50%) of spleen presence. Green to transparent gradient regions represent low probabilities (P < 50%) of spleen presence. The black contour around the raw probabilities represents the final output after postprocessing that is used to compute the splenic volume. The cyan dashed lines and triangles point to mistakes. Coronal and axial planes are shown, but (d) shows coronal and sagittal planes for better visualization. All images have a window center of 60 HU and a window width of 360 HU.

Examples of (a) excellent, (b) good, (c) bad, and (d) failed                         segmentation of the qualitative classification performed by the radiologist                         on dataset D100 (200 CT scans from 100 patients, spleen masks are                         unavailable) as part of the qualitative observer experiment. The figures                         show raw probabilities (before postprocessing) obtained by system B on                         dataset B300. Red regions represent high probabilities (P ≥ 50%) of                         spleen presence. Green to transparent gradient regions represent low                         probabilities (P < 50%) of spleen presence. The black contour around                         the raw probabilities represents the final output after postprocessing that                         is used to compute the splenic volume. The cyan dashed lines and triangles                         point to mistakes. Coronal and axial planes are shown, but (d) shows coronal                         and sagittal planes for better visualization. All images have a window                         center of 60 HU and a window width of 360 HU.

Figure 5b: Examples of (a) excellent, (b) good, (c) bad, and (d) failed segmentation of the qualitative classification performed by the radiologist on dataset D100 (200 CT scans from 100 patients, spleen masks are unavailable) as part of the qualitative observer experiment. The figures show raw probabilities (before postprocessing) obtained by system B on dataset B300. Red regions represent high probabilities (P ≥ 50%) of spleen presence. Green to transparent gradient regions represent low probabilities (P < 50%) of spleen presence. The black contour around the raw probabilities represents the final output after postprocessing that is used to compute the splenic volume. The cyan dashed lines and triangles point to mistakes. Coronal and axial planes are shown, but (d) shows coronal and sagittal planes for better visualization. All images have a window center of 60 HU and a window width of 360 HU.

Examples of (a) excellent, (b) good, (c) bad, and (d) failed                         segmentation of the qualitative classification performed by the radiologist                         on dataset D100 (200 CT scans from 100 patients, spleen masks are                         unavailable) as part of the qualitative observer experiment. The figures                         show raw probabilities (before postprocessing) obtained by system B on                         dataset B300. Red regions represent high probabilities (P ≥ 50%) of                         spleen presence. Green to transparent gradient regions represent low                         probabilities (P < 50%) of spleen presence. The black contour around                         the raw probabilities represents the final output after postprocessing that                         is used to compute the splenic volume. The cyan dashed lines and triangles                         point to mistakes. Coronal and axial planes are shown, but (d) shows coronal                         and sagittal planes for better visualization. All images have a window                         center of 60 HU and a window width of 360 HU.

Figure 5c: Examples of (a) excellent, (b) good, (c) bad, and (d) failed segmentation of the qualitative classification performed by the radiologist on dataset D100 (200 CT scans from 100 patients, spleen masks are unavailable) as part of the qualitative observer experiment. The figures show raw probabilities (before postprocessing) obtained by system B on dataset B300. Red regions represent high probabilities (P ≥ 50%) of spleen presence. Green to transparent gradient regions represent low probabilities (P < 50%) of spleen presence. The black contour around the raw probabilities represents the final output after postprocessing that is used to compute the splenic volume. The cyan dashed lines and triangles point to mistakes. Coronal and axial planes are shown, but (d) shows coronal and sagittal planes for better visualization. All images have a window center of 60 HU and a window width of 360 HU.

Examples of (a) excellent, (b) good, (c) bad, and (d) failed                         segmentation of the qualitative classification performed by the radiologist                         on dataset D100 (200 CT scans from 100 patients, spleen masks are                         unavailable) as part of the qualitative observer experiment. The figures                         show raw probabilities (before postprocessing) obtained by system B on                         dataset B300. Red regions represent high probabilities (P ≥ 50%) of                         spleen presence. Green to transparent gradient regions represent low                         probabilities (P < 50%) of spleen presence. The black contour around                         the raw probabilities represents the final output after postprocessing that                         is used to compute the splenic volume. The cyan dashed lines and triangles                         point to mistakes. Coronal and axial planes are shown, but (d) shows coronal                         and sagittal planes for better visualization. All images have a window                         center of 60 HU and a window width of 360 HU.

Figure 5d: Examples of (a) excellent, (b) good, (c) bad, and (d) failed segmentation of the qualitative classification performed by the radiologist on dataset D100 (200 CT scans from 100 patients, spleen masks are unavailable) as part of the qualitative observer experiment. The figures show raw probabilities (before postprocessing) obtained by system B on dataset B300. Red regions represent high probabilities (P ≥ 50%) of spleen presence. Green to transparent gradient regions represent low probabilities (P < 50%) of spleen presence. The black contour around the raw probabilities represents the final output after postprocessing that is used to compute the splenic volume. The cyan dashed lines and triangles point to mistakes. Coronal and axial planes are shown, but (d) shows coronal and sagittal planes for better visualization. All images have a window center of 60 HU and a window width of 360 HU.

Discussion

In this article, we developed an algorithm to segment the spleen using DL on three-dimensional thorax-abdomen CT scans from patients undergoing oncologic workup. The final system (system B, 0.962 Dice) and the independent observer (0.964 Dice) obtained comparable results with no significant (P = .834) difference. In the qualitative observer experiment, we showed that a radiologist improved the performance when assisted by the algorithm to assess SVC.

For the development of the algorithm, an initial dataset of 100 random scans was annotated (dataset A) to train the first system (system A). System A obtained a mean Dice score of 0.950 (95% CI: 0.938, 0.959) on the test set (see Table 1). After adding the 300 relevant cases (dataset B300) from dataset B to the training set, a second system (system B) was trained, and this system reached a mean Dice score of 0.962 (95% CI: 0.957, 0.966) on the test set. The independent observer obtained a comparable mean Dice score of 0.964 (95% CI: 0.961, 0.967).

Figure 3 shows that system B had a better and more robust performance with fewer outliers than system A. The most challenging case on the test set had a beavertail liver on the CT scan, obtaining a 0.88 Dice score for system B. In the same case, the independent observer obtained a Dice of 0.94, showing that it was also difficult for the independent observer. Table 1 and Figure 3 show that our algorithm approximated the independent observer’s performance for all metrics. Our selection process of relevant cases boosted the performance from 0.950 Dice (system A trained with initial dataset A) to 0.962 Dice (system B trained with datasets A and B300).

Based on the visual ratings of the segmentation quality, our method could reliably handle difficult cases. Figure 5c and 5d show that an abnormal anatomy can lead to less accurate spleen segmentation.

In the SVC analysis, the readers had to come to a consensus in 13 cases to define the reference standard on dataset D100. In the observer experiment, the radiologist changed the classification of 15% (15 of 100) of the patients when going from the visual SVCa to the assisted SVCa. This resulted in a more reliable SVC classification because the SVC is now computed based on precise segmentations and not based on an approximation as the volume obtained by the splenic index. Figure 4, D, shows a scan-pair in which the radiologist was likely misled. The stomach of the patient is full in the baseline scan, which pushes the spleen toward the ribs. On the follow-up scan, the stomach is empty, giving more space to the spleen to expand. Although the volume changed −10% over time, it was within the range of a normal SVC defined by us. This indicated that our method can help radiologists to reduce bias when measuring SVC. In this work, the threshold to classify shrinkage, normal (no substantial), and growth SVC was defined as +25%. Three cases obtained automatic SVCa values around this fixed threshold. Future investigations will be useful to define better thresholds for clinical practice to classify SVC. Note that our qualitative observer experiment resulted in percentages that were not representative for a large random set because we were using the enriched dataset D100. This subset was created by selecting 50 random patients classified as having substantial (either growth or shrinkage) SVC and 50 random patients classified as normal (no substantial) SVC after automatic SVCa. A fully random selection from dataset D to obtain dataset D100 would have resulted in a higher number of normal cases, which would have been less interesting for the observer experiment of our study.

Previous work is summarized in Table 2. The methods that used DL (2224) were methods for multiorgan segmentation. None of the mentioned articles selected relevant cases from a large set of scans as we did in this study. Gibson et al (24) obtained 0.950 Dice score after applying transfer learning to improve their results. Linguraru et al (12) used probabilistic atlas and registration to segment the liver and spleen (0.952 Dice). These methods were trained with less data (from 90 to 331 scans) than the data used for our best performing system (450 scans).

Our method showed reliable results; however, it had some limitations. For instance, patients with severe distortions in the body may obtain irregular automatic segmentations. Similarly, when the spleen is absent (splenectomy), this algorithm may segment a small false-positive region in the region where the spleen is usually located. These erroneous segmentations can be prevented by discarding candidates under a certain threshold of splenic volume. Although we qualitatively measured the performance of our algorithm in a large dataset, a quantitative measurement on a large set would help to validate our algorithm more thoroughly but requires a substantial annotation effort. Another limitation of this study was related to the selection of the 300 relevant scans from dataset B to obtain dataset B300 because this selection was performed by a single observer and another observer may have selected different cases. This may have introduced bias, but only affected the training of our algorithm and we expect this effect to be small. A trained medical student was used as the human independent observer for the quantitative validation, and an experienced radiologist may have performed slightly better. Finally, this algorithm was trained and evaluated using data from a single hospital. Future studies should focus on training and validation using multicenter data to increase the robustness of the algorithm.

In conclusion, fully automated spleen segmentation is feasible in complex scenarios such as oncologic follow-up. The performance of the DL algorithm was comparable to that of an independent observer on the test set. This method showed potential to help radiologists in classifying SVC accurately. Future studies are needed to investigate how this algorithm can affect the workflow of a radiologist and what effect it has on the overall scan interpretation. Future validation studies should include multicenter data and should be performed prospectively to test whether this algorithm can be safely and reliably used in clinical practice.

Disclosures of Conflicts of Interest: G.E.H.M. disclosed no relevant relationships. J.B. disclosed no relevant relationships. E.T.S. disclosed no relevant relationships. M.P. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution receives grants from Siemens Healthineers and Canon Medical Systems; institution paid for lectures by Bracco, Bayer, Canon Medical Systems, and Siemens Healthineers. Other relationships: disclosed no relevant relationships. B.v.G. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution receives royalties from Delft Imaging Systems, Thirona, MeVis Medical Systems; author has stocks in Thirona. Other relationships: disclosed no relevant relationships. C.J. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution receives grants from MeVis Medical Solutions (Bremen, Germany) (research grant); institution receives royalties from Veolity lung screening workstation from MeVis Medical Solutions (Bremen, Germany). Other relationships: disclosed no relevant relationships.

Author Contributions

Author contributions: Guarantors of integrity of entire study, G.E.H.M., B.v.G., C.J.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, G.E.H.M., J.B., M.P., B.v.G., C.J.; clinical studies, G.E.H.M., M.P., B.v.G., C.J.; experimental studies, G.E.H.M., J.B., E.T.S., B.v.G., C.J.; statistical analysis, G.E.H.M., J.B., C.J.; and manuscript editing, G.E.H.M., J.B., E.T.S., B.v.G., C.J.

Supported by the program Automation in Medical Imaging from the Fraunhofer Society and Radboud University.

References

  • 1. Harris A, Kamishima T, Hao HY, et al. Splenic volume measurements on computed tomography utilizing automatically contouring software and its relationship with age, gender, and anthropometric parameters. Eur J Radiol 2010;75(1):e97–e101. Crossref, MedlineGoogle Scholar
  • 2. Robertson F, Leander P, Ekberg O. Radiology of the spleen. Eur Radiol 2001;11(1):80–95. Crossref, MedlineGoogle Scholar
  • 3. Bergman RA, Heidger PM, Scott-Conner CE. The Anatomy of the Spleen. In: Bowdler AJ, ed. The Complete Spleen: Structure, Function, and Clinical Disorders. Totowa, NJ: Humana, 2002; 3–9. CrossrefGoogle Scholar
  • 4. Moroz P, Anderson JE, Van Hazel G, Gray BN. Effect of selective internal radiation therapy and hepatic arterial chemotherapy on normal liver volume and spleen volume. J Surg Oncol 2001;78(4):248–252. Crossref, MedlineGoogle Scholar
  • 5. Jacobs KE, Visser BC, Gayer G. Changes in spleen volume after resection of hepatic colorectal metastases. Clin Radiol 2012;67(10):982–987. Crossref, MedlineGoogle Scholar
  • 6. De Odorico I, Spaulding KA, Pretorius DH, Lev-Toaff AS, Bailey TB, Nelson TR. Normal splenic volumes estimated using three-dimensional ultrasonography. J Ultrasound Med 1999;18(3):231–236. Crossref, MedlineGoogle Scholar
  • 7. Joiner BJ, Simpson AL, Leal JN, D’Angelica MI, Do RK. Assessing splenic enlargement on CT by unidimensional measurement changes in patients with colorectal liver metastases. Abdom Imaging 2015;40(7):2338–2344. Crossref, MedlineGoogle Scholar
  • 8. Cruz-Romero C, Agarwal S, Abujudeh HH, Thrall J, Hahn PF. Spleen volume on CT and the effect of abdominal trauma. Emerg Radiol 2016;23(4):315–323. Crossref, MedlineGoogle Scholar
  • 9. Simpson AL, Leal JN, Pugalenthi A, et al. Chemotherapy-induced splenic volume increase is independently associated with major complications after hepatic resection for metastatic colorectal cancer. J Am Coll Surg 2015;220(3):271–280. Crossref, MedlineGoogle Scholar
  • 10. Prassopoulos P, Daskalogiannaki M, Raissaki M, Hatjidakis A, Gourtsoyiannis N. Determination of normal splenic volume on computed tomography in relation to age, gender and body habitus. Eur Radiol 1997;7(2):246–248. Crossref, MedlineGoogle Scholar
  • 11. Yetter EM, Acosta KB, Olson MC, Blundell K. Estimating splenic volume: sonographic measurements correlated with helical CT determination. AJR Am J Roentgenol 2003;181(6):1615–1620. Crossref, MedlineGoogle Scholar
  • 12. Linguraru MG, Sandberg JK, Li Z, Shah F, Summers RM. Automated segmentation and quantification of liver and spleen from CT images using normalized probabilistic atlases and enhancement estimation. Med Phys 2010;37(2):771–783. Crossref, MedlineGoogle Scholar
  • 13. Tong T, Wolz R, Wang Z, et al. Discriminative dictionary learning for abdominal multi-organ segmentation. Med Image Anal 2015;23(1):92–104. Crossref, MedlineGoogle Scholar
  • 14. Wolz R, Chu C, Misawa K, Fujiwara M, Mori K, Rueckert D. Automated abdominal multi-organ segmentation with subject-specific atlas generation. IEEE Trans Med Imaging 2013;32(9):1723–1730. Crossref, MedlineGoogle Scholar
  • 15. Okada T, Linguraru MG, Hori M, Summers RM, Tomiyama N, Sato Y. Abdominal multi-organ segmentation from CT images using conditional shape-location and unsupervised intensity priors. Med Image Anal 2015;26(1):1–18. Crossref, MedlineGoogle Scholar
  • 16. Hammon M, Dankerl P, Kramer M, et al. Automated detection and volumetric segmentation of the spleen in CT scans [in German]. Rofo 2012;184(8):734–739. MedlineGoogle Scholar
  • 17. Wood A, Soroushmehr SMR, Farzaneh N, et al. Fully automated spleen localization and segmentation using machine learning and 3D active contours. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, July 18–21, 2018. Piscataway, NJ: IEEE, 2018; 53–56. CrossrefGoogle Scholar
  • 18. Gloger O, Tönnies K, Bülow R, Völzke H. Automatized spleen segmentation in non-contrast-enhanced MR volume data using subject-specific shape priors. Phys Med Biol 2017;62(14):5861–5883. Crossref, MedlineGoogle Scholar
  • 19. Gauriau R, Ardori R, Lesage D, Bloch I. Multiple template deformation application to abdominal organ segmentation. In: 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), New York, NY, April 16–19, 2015. Piscataway, NJ: IEEE, 2015; 359–362. Google Scholar
  • 20. Litjens G, Kooi T, Ehteshami Bejnordi B, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60–88. Crossref, MedlineGoogle Scholar
  • 21. Huo Y, Xu Z, Bao S, et al. Splenomegaly segmentation using global convolutional kernels and conditional generative adversarial networks. In: Angelini ED, Landman BA, eds. Proceedings of SPIE: medical imaging 2018—image processing. Vol 10574. Bellingham, Wash: International Society for Optics and Photonics, 2018; 1057409. CrossrefGoogle Scholar
  • 22. Zhou X, Ito T, Takayama R, Wang S, Hara T, Fujita H. Three-Dimensional CT Image Segmentation by Combining 2D Fully Convolutional Network with 3D Majority Voting. In: Carneiro G, Mateus D, Loïc P, et al, eds. Deep Learning and Data Labeling for Medical Applications. DLMIA 2016, LABELS 2016. Lecture Notes in Computer Science, vol 10008. Cham, Switzerland: Springer, 2016; 111–120. CrossrefGoogle Scholar
  • 23. Roth HR, Oda H, Zhou X, et al. An application of cascaded 3D fully convolutional networks for medical image segmentation. Comput Med Imaging Graph 2018;66:90–99. Crossref, MedlineGoogle Scholar
  • 24. Gibson E, Giganti F, Hu Y, et al. Automatic multi-organ segmentation on abdominal CT with dense V-networks. IEEE Trans Med Imaging 2018;37(8):1822–1834. Crossref, MedlineGoogle Scholar
  • 25. Landman BA, Bobo MF, Huo Y, et al. Fully convolutional neural networks improve abdominal organ segmentation. In: Angelini ED, Landman BA, eds. Proceedings of SPIE: medical imaging 2018—image processing. Vol 10574. Bellingham, Wash: International Society for Optics and Photonics, 2018; 105742V. CrossrefGoogle Scholar
  • 26. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, eds. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Cham, Switzerland: Springer, 2015; 234–241. CrossrefGoogle Scholar
  • 27. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In: Ourselin S, Joskowicz L, Sabuncu M, Unal G, Wells W, eds. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science, vol 9901. Cham, Switzerland: Springer, 2016; 424–432. Google Scholar
  • 28. Chlebus G, Schenk A, Moltz JH, van Ginneken B, Hahn HK, Meine H. Automatic liver tumor segmentation in CT with fully convolutional neural networks and object-based postprocessing. Sci Rep 2018;8(1):15497. Crossref, MedlineGoogle Scholar
  • 29. Dalmış MU, Litjens G, Holland K, et al. Using deep learning to segment breast and fibroglandular tissue in MRI volumes. Med Phys 2017;44(2):533–546. Crossref, MedlineGoogle Scholar
  • 30. Aresta G, Araújo T, Jacobs C, et al. Towards an Automatic Lung Cancer Screening System in Low Dose Computed Tomography. In: Stoyanov D, Taylor Z, Kainz B, et al, eds. Image Analysis for Moving Organ, Breast, and Thoracic Images. RAMBO 2018, BIA 2018, TIA 2018. Lecture Notes in Computer Science, vol 11040. Cham, Switzerland: Springer, 2018; 310–318. CrossrefGoogle Scholar
  • 31. Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017;318(22):2199–2210. Crossref, MedlineGoogle Scholar
  • 32. Setio AAA, Traverso A, de Bel T, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med Image Anal 2017;42:1–13. Crossref, MedlineGoogle Scholar
  • 33. Zhu W, Huang Y, Zeng L, et al. AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Med Phys 2019;46(2):576–589. Crossref, MedlineGoogle Scholar
  • 34. Pattanayak P, Turkbey EB, Summers RM. Comparative evaluation of three software packages for liver and spleen segmentation and volumetry. Acad Radiol 2017;24(7):831–839. Crossref, MedlineGoogle Scholar
  • 35. Humpire GE, Setio AAA, van Ginneken B, Jacobs C. Efficient organ localization using multi-label convolutional neural networks in thorax-abdomen CT scans. Phys Med Biol 2018;63(8):085003. Crossref, MedlineGoogle Scholar
  • 36. Isensee F, Petersen J, Kohl SAA, Jager PF, Maier-Hein KH. nnU-Net: Breaking the spell on successful medical image segmentation. arXiv:1904.08128 [cs] [preprint]. http://arxiv.org/abs/1904.08128. Posted April 17, 2019. Accessed June 4, 2019. Google Scholar
  • 37. Balagopal A, Kazemifar S, Nguyen D, et al. Fully automated organ segmentation in male pelvic CT images. Phys Med Biol 2018;63(24):245015. Crossref, MedlineGoogle Scholar
  • 38. Guo Z, Zhang L, Lu L, et al. Deep LOGISMOS: deep learning graph-based 3D segmentation of pancreatic tumors on CT scans. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, April 4–7, 2018. Piscataway, NJ: IEEE, 2018; 1230–1233. CrossrefGoogle Scholar

Article History

Received: June 13 2019
Revision requested: July 17 2019
Revision received: Apr 26 2020
Accepted: May 1 2020
Published online: July 22 2020