Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks
Abstract
Purpose
To evaluate the efficacy of deep convolutional neural networks (DCNNs) for detecting tuberculosis (TB) on chest radiographs.
Materials and Methods
Four deidentified HIPAA-compliant datasets were used in this study that were exempted from review by the institutional review board, which consisted of 1007 posteroanterior chest radiographs. The datasets were split into training (68.0%), validation (17.1%), and test (14.9%). Two different DCNNs, AlexNet and GoogLeNet, were used to classify the images as having manifestations of pulmonary TB or as healthy. Both untrained and pretrained networks on ImageNet were used, and augmentation with multiple preprocessing techniques. Ensembles were performed on the best-performing algorithms. For cases where the classifiers were in disagreement, an independent board-certified cardiothoracic radiologist blindly interpreted the images to evaluate a potential radiologist-augmented workflow. Receiver operating characteristic curves and areas under the curve (AUCs) were used to assess model performance by using the DeLong method for statistical comparison of receiver operating characteristic curves.
Results
The best-performing classifier had an AUC of 0.99, which was an ensemble of the AlexNet and GoogLeNet DCNNs. The AUCs of the pretrained models were greater than that of the untrained models (P < .001). Augmenting the dataset further increased accuracy (P values for AlexNet and GoogLeNet were .03 and .02, respectively). The DCNNs had disagreement in 13 of the 150 test cases, which were blindly reviewed by a cardiothoracic radiologist, who correctly interpreted all 13 cases (100%). This radiologist-augmented approach resulted in a sensitivity of 97.3% and specificity 100%.
Conclusion
Deep learning with DCNNs can accurately classify TB at chest radiography with an AUC of 0.99. A radiologist-augmented approach for cases where there was disagreement among the classifiers further improved accuracy.
© RSNA, 2017
Introduction
Tuberculosis (TB) is an infectious disease caused by the bacillus Mycobacterium tuberculosis. TB is a leading cause of death by infectious disease worldwide, alongside human immunodeficiency virus–acquired immune deficiency syndrome (known as HIV-AIDS) (
While indiscriminate mass screening for TB should be avoided, the World Health Organization recommends broader use of screening by chest radiography and rapid molecular diagnostics for selected high-risk groups (
It has been reported (
Commercially available software (CAD4TB; Image Analysis Group, Nijmegen, the Netherlands) had an area under the curve (AUC) that ranged from 0.71 to 0.84 in five studies, according to one review (
Currently, deep learning techniques are considered to be state of the art for classification of images, which arises from the recent success in the ImageNet Large Scale Visual Recognition Competition (
There is interest in applying deep learning in radiology because of the recent success, and with promising results. Some examples include detection of pleural effusion and cardiomegaly at chest radiography (
In this study, we evaluate the efficacy of DCNN for detection of TB on chest radiographs.
Materials and Methods
Datasets
All datasets were deidentified and compliant with the Health Insurance Portability and Accountability Act. The Belarus and Thomas Jefferson University datasets were exempted from institutional review board review at Thomas Jefferson University Hospital. The National Institutes of Health datasets were exempted from review by the institutional review board (No. 5357) by the National Institutes of Health Office of Human Research Protection Programs. This was a retrospective study that involved four datasets (Table 1). This includes two publicly available datasets maintained by the National Institutes of Health, which are from Montgomery County, Maryland, and Shenzhen, China (
![]() |
Methods
The chest radiographic images were resized to a 256 × 256 matrix and converted into Portable Network Graphics format. The images were loaded onto a computer with a Linux operating system (Ubuntu 14.04; Canonical, London, England) and with the Caffe deep learning framework (http://caffe.berkeleyvision.org; BVLC, Berkeley, Calif) (
Two different deep convolutional neural network architectures were evaluated in this study, AlexNet (
All images were augmented by using random cropping of 227 × 227 pixels, mean subtraction, and mirror images, which were prebuilt options within the Caffe framework. Further augmentation was performed in training some of the DCNNs, including rotations of 90°, 180°, and 270°, and Contrast Limited Adaptive Histogram Equalization processing by using ImageJ v. 1.50i (NIH, Bethesda, Md) (
Of the 1007 patients in the total dataset (Table 1), 150 random patients (14.9%) were selected for testing. Randomization was performed by using pseudorandom numbers generated from the random function in the Python Standard Library (Python 2.7.13, Python Software Foundation, Wilmington, Del). Of these 150 test patients, 75 were positive for TB and 75 were healthy. Among the remaining 857 patients, they were randomly split into an 80%:20% ratio into training (685 patients) and validation (172 patients). The training set was used to train the algorithm, the validation set was for model selection, and the test set was for assessment of the final chosen model. In deciding the percent split, the goal is to keep enough data for the algorithms to train from but have enough validation and test cases to maintain a reasonable confidence interval of the accuracy of the model (
The 75 test patients positive for TB were analyzed by a cardiothoracic radiologist (P.L.) for degree of pulmonary parenchymal involvement by TB and placed into one of the following three categories: subtle (pulmonary parenchymal involvement, <4%), intermediate (pulmonary parenchymal involvement, 4%–8%), and readily apparent (pulmonary parenchymal involvement, >8%) (Table 2). To determine this, the right and left lungs were divided into three zones (upper, middle, and lower). Opacities that occupied half or more of one zone were considered readily apparent. Opacities occupying a fourth to half of a zone were considered intermediate. Opacities occupying less than a fourth of a zone were considered subtle.
![]() |
Statistical and Data Analysis
All statistical analyses were performed by using software (MedCalc v. 16.8; MedCalc Software, Ostend, Belgium). On the test datasets, receiver operating characteristic curves and AUCs were determined (
Ensembles were performed by taking different weighted averages of the probability scores generated by the classifiers (AlexNet and GoogLeNet). This ranged from using equal weighting (50% AlexNet and 50% GoogLeNet) to up to 10-fold weighting biased toward either classifier. Receiver operating characteristic curves, AUC, and optimal sensitivity and specificity values were then determined for various ensemble approaches.
For cases where the AlexNet and GoogLeNet classifiers had disagreement, an independent board-certified cardiothoracic radiologist (B.S., with 18 years of experience) blindly interpreted the images as either having manifestations of TB or as normal. Contingency tables and sensitivity and specificity values were then created from these results (

Figure 1: Contingency tables. A, Sensitivity, 92.0% (95% confidence interval: 83.3%, 96.6%); specificity, 98.7% (95% confidence interval: 92.1%, 100%); accuracy, 95.3% (95% confidence interval: 90.5%, 97.9%). B, Sensitivity, 92.0% (95% confidence interval: 83.3%, 96.6%); specificity, 94.7% (95% confidence interval: 86.7%, 98.3%); accuracy, 93.3% (95% confidence interval: 88.0%, 96.5%). C, Sensitivity, 97.3% (95% confidence interval: 90.2%, 99.8%); specificity, 94.7% (95% confidence interval: 86.7%, 98.3%); accuracy, 96.0% (95% confidence interval: 91.4%, 98.3%). D, Sensitivity, 97.3% (95% confidence interval: 90.2%, 99.8%); specificity, 100% (95% confidence interval: 95.8%, 100%); accuracy, 98.7% (95% confidence interval: 95.0%, 99.9%).
Results
A summary of the results is provided in Table 3. For both deep neural networks, the AUCs of the pretrained models (AlexNet-T, GoogLeNet-T) were greater than that of the untrained models (AlexNet-U, GoogLeNet-U) (P < .001). In addition, augmentation of the dataset with additional transformations, such as rotations and Contrast Limited Adaptive Histogram Equalization, further increased accuracy for both neural networks (AlexNet-UA, GoogLeNet-UA) over untrained models (AlexNet-U, GoogLeNet-U) (P = .03 for AlexNet and P = .02 for GoogLeNet). The best-performing ensemble model had an AUC of 0.99, which was significantly greater than that of the untrained AlexNet-U and GoogLeNet-U models, which had AUCs of 0.90 and 0.88, respectively (P < .001).
![]() |
A comparison of receiver operating characteristic curves for the untrained and pretrained augmented models for AlexNet and GoogLeNet, as well as ensemble approaches, are provided in

Figure 2a: (a) Comparison of receiver operating characteristic curves for the untrained AlexNet-U and GoogLeNet-U models and pretrained with augmentation AlexNet-TA and GoogLeNet-TA models. The receiver operating characteristic curves for the AlexNet-TA and GoogLeNet-TA models had an AUC that was significantly greater than that for the untrained AlexNet-U and GoogLeNet-U models (P < .001) (Table 3). (b) Comparison of receiver operating characteristic curves for the AlexNet-TA, GoogLeNet-TA, and ensemble of the two models. The ensemble provided the best AUC (Table 3).

Figure 2b: (a) Comparison of receiver operating characteristic curves for the untrained AlexNet-U and GoogLeNet-U models and pretrained with augmentation AlexNet-TA and GoogLeNet-TA models. The receiver operating characteristic curves for the AlexNet-TA and GoogLeNet-TA models had an AUC that was significantly greater than that for the untrained AlexNet-U and GoogLeNet-U models (P < .001) (Table 3). (b) Comparison of receiver operating characteristic curves for the AlexNet-TA, GoogLeNet-TA, and ensemble of the two models. The ensemble provided the best AUC (Table 3).
The contingency tables for the best-performing models, including GoogLeNet-TA, AlexNet-TA, ensemble of AlexNet-TA, and GoogLeNet-TA are provided in
The distribution of the conspicuity of the 75 test patients who were positive for TB is provided in Table 2.
Radiologist-augmented Approach
The classifiers (AlexNet-TA and GoogLeNet-TA) had disagreement in 13 of the 150 test cases. The 13 discordant cases were then blindly reviewed by a cardiothoracic radiologist, who correctly interpreted all 13 cases (100%). The contingency table of this radiologist-augmented approach is provided in
Discussion
Machine learning is a branch of artificial intelligence in which computers are not explicitly programmed but can perform tasks by analyzing relationships of existing data (
One of the advantages of deep learning is its ability to excel with high-dimensional datasets, such as images, which can be represented at multiple levels. For example, regarding images, DCNNs can be represented at lower levels with pixel intensity values, edges, and blobs; at intermediate levels, with parts of objects; and at higher levels, the object as a whole.
In this study, the DCNNs pretrained with everyday images on ImageNet performed better than the untrained networks, concordant with previously published works (Table 3,
Augmentation of the dataset with rotated images and image contrast enhancement with Contrast Limited Adaptive Histogram Equalization further improved performance (Table 3,
One of the problems with machine learning, including deep learning, is overfitting (

Figure 3: Training curve of AlexNet-TA classifier. The orange line represents the accuracy over the course of training, which increases over time, with a final accuracy of 98.2% at the final epoch. Training was performed for 120 epochs, and each epoch represents one pass through the entire training dataset. The blue and green curves represent the loss on the training and validation datasets, which decreases over time. The loss represents the fit between a prediction and the ground truth label. As expected, there is a reduction of loss over the course of training as accuracy improves. The loss on the validation is similar to the training, which indicates that there is no appreciable overfitting. These training curves are used for model selection. In this case, the best performing model at epoch 120 was used on the test data for final assessment. Val = validation.
The use of ensembles is another method to improve performance. This involves blending multiple algorithms to improve the predictive performance compared with any one algorithm alone (
It was previously described (

Figure 4a: (a) Posteroanterior chest radiograph shows upper lobe opacities with pathologic analysis–proven active TB. (b) Same posteroanterior chest radiograph, with a heat map overlay of one of the strongest activations obtained from the fifth convolutional layer after it was passed through the GoogLeNet-TA classifier. The red and light blue regions in the upper lobes represent areas activated by the deep neural network. The dark purple background represents areas that are not activated. This shows that the network is focusing on parts of the image where the disease is present (both upper lobes).

Figure 4b: (a) Posteroanterior chest radiograph shows upper lobe opacities with pathologic analysis–proven active TB. (b) Same posteroanterior chest radiograph, with a heat map overlay of one of the strongest activations obtained from the fifth convolutional layer after it was passed through the GoogLeNet-TA classifier. The red and light blue regions in the upper lobes represent areas activated by the deep neural network. The dark purple background represents areas that are not activated. This shows that the network is focusing on parts of the image where the disease is present (both upper lobes).
One potential method to improve accuracy is a radiologist-augmented system, in which some of the images are sent to a radiologist for a so-called overread. In this system, images that were discordant (classified by one DCNN as positive for TB and the other as negative for TB) were sent to the radiologist for the final interpretation. Of the 150 test images, the best AlexNet and GoogLeNet classifiers agreed 137 times (91.3%) and disagreed 13 times (8.7%). A blinded board-certified radiologist then reviewed the 13 discordant images and correctly classified all 13 images (100%). This radiologist-augmented approach increased the sensitivity to 97.3% and specificity to 100% (Table 2). There were two false-negative findings, which were findings missed by both the DCNNs, and therefore never made it to the radiologist for review. The false-negative findings are shown in

Figure 5a: Two images with false-negative findings missed by both classifiers. (a) An opacity in the right upper lobe (arrow) on a posteroanterior radiograph. (b) A more apparent right suprahilar opacity (arrow) on a posteroanterior radiograph.

Figure 5b: Two images with false-negative findings missed by both classifiers. (a) An opacity in the right upper lobe (arrow) on a posteroanterior radiograph. (b) A more apparent right suprahilar opacity (arrow) on a posteroanterior radiograph.
It is interesting to note that the DCNNs in this study outperformed that described by Hwang et al (
There are limitations to this work. The DCNNs do not replace human radiologic interpretation beyond that of TB because they are not tailored to evaluate other pathologic findings. In this study, the 75 images with tests positive for TB had a relatively even distribution of subtle, intermediate, and readily apparent opacities (Table 2). However, more research is needed to determine the performance of classifiers on only subtle opacities because obvious changes should be easier to detect than subtle ones. Another important factor to consider is that system performance will be affected by the selection of cases (percent of normal vs abnormal). Also, the system is designed for use in TB-prevalent regions with the goal of differentiating normal versus abnormal regarding TB evaluation, potentially part of a chest radiography screening program. If the algorithms were used in non-TB-prevalent locations and not solely for the purpose of TB evaluation, other pathologic findings that had a similar radiographic appearance, such as lung cancers and bacterial pneumonia, may be flagged as positive. As with multiple other studies that use deep learning (
Advances in Knowledge
■ Deep learning with convolutional neural networks can accurately classify tuberculosis (TB) at chest radiography with an area under the curve of 0.99.
■ Pretrained neural networks (P < .001) and augmented datasets (P = .02 and P = .03) resulted in greater accuracy.
■ The most accurate model incorporated a radiologist overread when the machines were discrepant, which had a net sensitivity of 97.3% and a specificity of 100%.
Implication for Patient Care
■ Automated detection of pulmonary TB at chest radiography may facilitate screening and evaluation efforts in TB-prevalent areas with limited access to radiologists.
Author Contributions
Author contributions: Guarantor of integrity of entire study, P.L.; study concepts/study design or data acquisition or data analysis/interpretation, P.L., B.S.; manuscript drafting or manuscript revision for important intellectual content, P.L., B.S.; approval of final version of submitted manuscript, P.L., B.S.; agrees to ensure any questions related to the work are appropriately resolved, P.L., B.S.; literature research, P.L., B.S.; clinical studies, P.L., B.S.; experimental studies, P.L.; statistical analysis, P.L.; and manuscript editing, P.L., B.S.
References
- 1.. . Global tuberculosis report 2015. http://apps.who.int/iris/bitstream/10665/191102/1/9789241565059_eng.pdf. Published October 28, 2015. Accessed September 20, 2016. Google Scholar
- 2. . Systematic screening for active tuberculosis: Principles and recommendations. http://www.who.int/tb/publications/Final_TB_Screening_guidelines.pdf. Published April 2013. Accessed September 20, 2016. Google Scholar
- 3. . Chest tuberculosis: Radiological review and imaging recommendations. Indian J Radiol Imaging 2015;25(3):213–225. Crossref, Medline, Google Scholar
- 4. . An automated tuberculosis screening strategy combining X-ray-based computer-aided detection and clinical information. Sci Rep 2016;6:25265. Crossref, Medline, Google Scholar
- 5. . High sensitivity of chest radiograph reading by clinical officers in a tuberculosis prevalence survey. Int J Tuberc Lung Dis 2011;15(10):1308–1314. Crossref, Medline, Google Scholar
- 6. . Automated Detection of Lung Diseases in Chest X-Rays. A Report to the Board of Scientific Counselors. US National Library of Medicine. https://lhncbc.nlm.nih.gov/system/files/pub9126.pdf. Published April 2015. Accessed September 20, 2016. Google Scholar
- 7. . Automatic screening for tuberculosis in chest radiographs: a survey. Quant Imaging Med Surg 2013;3(2):89–99. Medline, Google Scholar
- 8. . Computer-aided detection of pulmonary tuberculosis on digital chest radiographs: a systematic review. Int J Tuberc Lung Dis 2016;20(9):1226–1230. Crossref, Medline, Google Scholar
- 9. . Detection of tuberculosis using digital chest radiography: automated reading vs. interpretation by clinical officers. Int J Tuberc Lung Dis 2013;17(12):1613–1620. Crossref, Medline, Google Scholar
- 10. . Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging 2014;33(2):233–245. Crossref, Medline, Google Scholar
- 11. . Imagenet large scale visual recognition challenge. Int J Comput Vis 2015;115(3):211–252. Crossref, Google Scholar
- 12. . Deep residual learning for image recognition. arXiv preprint. https://arxiv.org/abs/1512.03385. Published December 10, 2015. Accessed September 20, 2016. Google Scholar
- 13. . Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278–2324. Crossref, Google Scholar
- 14. . Deep learning with non-medical training used for chest pathology identification. In: Hadjiiski LM, Tourassi GD, eds. Proceedings of SPIE: medical imaging 2015—computer-aided diagnosis. Vol 9414. Bellingham, Wash: International Society for Optics and Photonics, 2015; 94140V. Google Scholar
- 15. . Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 2016;35(5):1285–1298. Crossref, Medline, Google Scholar
- 16. . Computer-aided classification of lung nodules on computed tomography images via deep learning technique. Onco Targets Ther 2015;8:2015–2022. Medline, Google Scholar
- 17. . Deep convolutional networks for pancreas segmentation in CT imaging. In: Ourselin S, Styner MA, eds. Proceedings of SPIE: medical imaging 2015—image processing. Vol 9413. Bellingham, Wash: International Society for Optics and Photonics, 2015; 94131G. Google Scholar
- 18. . Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. Neuroimage 2015;108:214–224. Crossref, Medline, Google Scholar
- 19. . A novel approach for tuberculosis screening based on deep convolutional neural networks. In: Tourassi GD, Armato SG, eds. Proceedings of SPIE: medical imaging 2016—title. Vol 9785. Bellingham, Wash: International Society for Optics and Photonics, 2016; 97852W. Google Scholar
- 20. . Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg 2014;4(6):475–477. Medline, Google Scholar
- 21. . Belarus Public Health Web site. http://obsolete.tuberculosis.by/. Published September 1, 2011. Updated July 17, 2015. Accessed August 20, 2016. Google Scholar
- 22. . Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia 2014. New York, NY: ACM, 2014. Crossref, Google Scholar
- 23. . ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2012; 1097–1105. Google Scholar
- 24. . Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015; 1–9. Crossref, Google Scholar
- 25. . U.S. National Institutes of Health, Bethesda, Maryland, USA. http://imagej.nih.gov/ij/. 1997-2016. Google Scholar
- 26. . Model assessment and selection. In: The elements of statistical learning. 2nd ed. New York, NY: Springer, 2009; 219–259. Crossref, Google Scholar
- 27. . Receiver operating characteristic curves and their use in radiology. Radiology 2003;229(1):3–8. Link, Google Scholar
- 28. . Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837–845. Crossref, Medline, Google Scholar
- 29. . Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 2010;21(1):128–138. Crossref, Medline, Google Scholar
- 30. . The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 1997;30(7):1145–1159. Crossref, Google Scholar
- 31. . ROC graphs: Notes and practical considerations for researchers. Mach Learn 2004;31(1):1–38. Google Scholar
- 32. . Approximate is better than “exact” for interval estimation of binomial proportions. Am Stat 1998;52(2):119–126. Google Scholar
- 33. . Machine learning and radiology. Med Image Anal 2012;16(5):933–951. Crossref, Medline, Google Scholar
- 34. . Deep learning. Nature 2015;521(7553):436–444. Crossref, Medline, Google Scholar
- 35. . Deep image: Scaling up image recognition. arXiv preprint. https://arxiv.org/abs/1501.02876. Published January 13, 2015. Updated July 6, 2015. Accessed September 21, 2016. Google Scholar
- 36. . Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15(1):1929–1958. Google Scholar
- 37. . Ensemble methods in machine learning. Lect Notes Comput Sci 2000;1857:1–15. Crossref, Google Scholar
- 38. . Understanding neural networks through deep visualization. arXiv preprint. https://arxiv.org/abs/1506.06579. Published June 22, 2015. Accessed September 21, 2016. Google Scholar
Article History
Received October 5, 2016; revision requested November 23; revision received December 12; accepted January 9, 2017; final version accepted January 19.Published online: Apr 24 2017
Published in print: Aug 2017










