Reporting Diagnostic Accuracy Studies: Some Improvements after 10 Years of STARD
Abstract
Purpose
To evaluate how diagnostic accuracy study reports published in 2012 adhered to the Standards for Reporting of Diagnostic Accuracy (STARD) statement and whether there were any differences in reporting compared with 2000 and 2004.
Materials and Methods
PubMed was searched for studies published in 12 high-impact-factor journals in 2012 that evaluated the accuracy of one or more diagnostic tests against a clinical reference standard. Two independent reviewers scored reporting completeness of each article with the 25-item STARD checklist. Mixed-effects modeling was used to analyze differences in reporting with previous evaluations from articles published in 2000 and 2004.
Results
Included were 112 articles. The overall mean number of STARD items reported in 2012 was 15.3 ± 3.9 (standard deviation; range, 6.0–23.5). There was an improvement of 3.4 items (95% confidence interval: 2.6, 4.3) compared with studies published in 2000, and an improvement of 1.7 items (95% confidence interval: 0.9, 2.5) compared with studies published in 2004. Significantly more items were reported for single-gate studies compared with multiple-gate studies (16.8 vs 12.1, respectively; P < .001) and for studies that evaluated imaging tests compared with laboratory tests and other types of tests (17.0 vs 14.0 vs 14.5, respectively; P < .001).
Conclusion
Completeness of reporting improved in the 10 years after the launch of STARD, but it remains suboptimal for many articles. Reporting of inclusion criteria and sampling methods for recruiting patients, information about blinding, and confidence intervals for accuracy estimates are in need of further improvement.
© RSNA, 2014
Introduction
The Standards for Reporting of Diagnostic Accuracy (STARD) statement was first published in 2003 (1–3). The aim of STARD is to increase the transparency and completeness of reporting of diagnostic accuracy studies. The statement includes a list of 25 items that should be reported for studies to be scientifically and clinically informative to reviewers and readers.
Diagnostic accuracy studies are used to evaluate the ability of a test to identify patients with a target condition, typically a disease or a form of disease that distinguishes them from those without a target condition. These studies are prone to several types of bias (4–6). Furthermore, the accuracy of a test is not a fixed property; it depends on the clinical setting, the type of patients, and on how the test is performed and interpreted. This information should be provided in the study report, and readers will be able to judge the validity and applicability of the study results when reporting is adequate.
Evaluations of the completeness of reporting for diagnostic accuracy studies in 12 high-impact-factor journals in 2000 (before STARD) and 2004 (after STARD) found that, of the 25 items, an average of 1.8 additional items were reported after the introduction of STARD (7–9). The overall completeness of reporting remained suboptimal; slightly more than half of the items were reported by studies published in 2004.
To our knowledge, it is unknown whether the initial small but statistically significant improvement in reporting quality grew over the years (10). Our purpose was to evaluate how diagnostic accuracy study reports published in 2012 adhered to the STARD statement and whether there were any differences in reporting compared that in with 2000 and 2004.
Materials and Methods
Literature Search
We made use of the search and selection methods developed for the evaluations of adherence to STARD among studies published in 2000 and 2004 (7,8). On September 17, 2013, we searched PubMed for diagnostic accuracy studies by using a previously validated search filter (“sensitivity AND specificity.sh” or “specificit*.tw” or “false negative.tw” or “accuracy.tw”) (11). The search was limited to studies on human subjects, reported in 2012 in six general medical journals (Annals of Internal Medicine, Archives of Internal Medicine, BMJ, JAMA, Lancet, and New England Journal of Medicine) and six discipline-specific journals (Archives of Neurology, Clinical Chemistry, Circulation, Gut, Neurology, and Radiology). All of these journals had an impact factor higher than 4 in 2000, 2004, and 2012.
Study Selection
We included articles if they reported in detail on a study that evaluated the diagnostic accuracy of one or more tests against a clinical reference standard in human subjects and reported an estimate of accuracy (sensitivity, specificity, likelihood ratios, predictive values, diagnostic odds ratio, and area under the receiver operator curve). We excluded studies about the predictive and prognostic accuracy of tests, as well as reviews, letters, viewpoints, and commentaries.
Two authors (D.A.K. and W.A.V.E., with 3 and 5 years of experience, respectively, in performance of systematic reviews) independently scanned titles, abstracts, and keywords of the search results to identify potentially eligible articles. In line with the previous evaluations of adherence to STARD (7,8), we assessed only a fourth of the potentially eligible articles published in Radiology because of the relatively large number of diagnostic accuracy studies reported in this journal. By using a random number generator (Excel; Microsoft, Redmond, Wash), we built a random list of the potentially eligible articles from this journal and selected at least two articles from each month of the year, starting at the top of the list.
If an article was considered to be potentially eligible by at least one author, the full text was assessed independently by both authors against the inclusion criteria. Disagreements were resolved through discussion. Whenever necessary, a third author (P.M.M.B.) made the final decision.
Data Extraction
On the basis of the study design, we classified reports of included studies as single-gate studies (or cohort studies, with a single set of inclusion criteria for participants) or multiple-gate studies (or case-control studies, with two or more sets of inclusion criteria) (12). Depending on the index test under investigation, studies were categorized as those that evaluated imaging tests, laboratory tests, or other types of tests (eg, physical examination). We examined the instructions to authors of the 12 included journals and categorized them as “adopters” if the use of STARD was required or recommended and as “nonadopters” if it was not.
Adherence to STARD
Between November 2013 and February 2014, we examined the extent to which included articles adhered to the 25 items on the STARD list by using a standardized score form previously developed and validated for the evaluation of studies published in 2000 and 2004 (7,8,13). For each included article, we counted the number of reported STARD items.
Six items on the STARD list concern both the index test and the reference standard: item 8 (technical specifications), item 9 (cutoffs and categories), item 10 (number and expertise of readers), item 11 (blinding), item 13 (methods for test reproducibility), and item 24 (results of test reproducibility). These items were evaluated separately for the index test and reference standard. They could be fully reported (for both index test and reference standard), halfway reported (only for index test or for reference standard), or not reported (not for index test and reference standard). If halfway reported, they were counted as one-half. We also assessed whether included articles contained a flowchart, which was strongly recommended by STARD.
Although previous studies have reported good interreviewer agreement regarding the scoring of STARD items (10,13), the list was originally designed to guide authors, editors, and peer reviewers, not as a tool to assess completeness of reporting. Inevitably, when it is used as a tool to assess completeness of reporting, scoring of some elements is subjective. To assure high interreviewer agreement for each item, a training session was organized. Two included articles were assessed by one author (N.S.) who had also scored all the reports in the 2000 and 2004 evaluations, and by all reviewers involved in the current evaluation. These two articles were discussed in a training session until consensus on all STARD items was reached. In addition, the principal reviewer of the previous evaluations (N.S.) had several meetings with the principal reviewer of the current analysis (D.A.K.), in which they discussed the scoring of STARD items in detail and any ambiguities encountered during the scoring process. After this, one principal reviewer (D.A.K., with 1 year of experience in performing literature reviews of diagnostic accuracy studies) and a second reviewer (one of the following: J.W., with 1 year of experience; W.A.V.E., with 3 years of experience; or M.M.L., L.H., or P.M.M.B., each with more than 10 years of experience performing literature reviews of diagnostic accuracy studies) independently reviewed all included articles. Disagreements were resolved through discussion, but judgment from a third reviewer (P.M.M.B.) was decisive, if necessary. Reviewers were not blinded to author or journal.
Statistical Analysis
For each article that was included, we counted the number of STARD items reported (range, 0–25 items) and calculated an overall mean, range, and standard deviation for the entire group. We calculated the percentage of agreement to score STARD items for the first, middle, and last article evaluated by each second reviewer (15 studies in total). For each item on the STARD list, the number and percentage of articles that reported the item were calculated.
We used Student t test statistics to compare the total number of STARD items reported between studies that were published in general medical journals and discipline-specific journals, and between single-gate and multiple-gate studies. We used one-way analysis of variance to compare studies that evaluated imaging tests, laboratory tests, and other types of tests. These subgroup analyses were also performed with nonparametric test statistics by using Mann-Whitney U and Kruskal-Wallis tests.
To determine whether the reporting of individual items had improved, for each item we compared the proportion of articles that reported the item in 2012 with the corresponding proportions for 2000 and 2004. By using logistic mixed-effects modeling, we accounted for systematic differences in STARD adherence between journals. The mean number of STARD items reported in 2000, 2004, and 2012 was compared by using linear mixed-effects modeling, which again accounted for between-journal differences. We used χ2 test statistics to evaluate whether features of included articles differed systematically from those in the 2000 and 2004 evaluations.
Data were analyzed by using statistical software (SPSS version 22; SPSS, Chicago, Ill). We performed mixed-effects modeling by using other statistical software (MASS package in R version 3.0; R Foundation for Statistical Computing, Vienna, Austria).
Results
Search Results and Characteristics of Included Studies
The literature search resulted in 600 publications. On the basis of the title, abstract, and keywords, 273 articles were considered to be potentially eligible (Fig 1). As planned, we randomly excluded three-fourths (95 of 127) of the potentially eligible articles in Radiology. After examining the full texts of the remaining 178 articles, 112 diagnostic accuracy study reports published in 2012 were considered potentially eligible. Reasons for exclusion of potentially eligible articles are provided in Figure 1. References to the included and excluded studies are available in the Appendix E1 (online).
We considered all but one of the 12 journals to be STARD adopters. Eight of the 12 journals made a clear statement that they required adherence to STARD in their instructions to authors, while three journals only provided a reference to the STARD statement. In 2004, seven journals were considered to be STARD adopters (Table 1) (8).
The number and characteristics of diagnostic accuracy studies are provided in Table 1. We found that 82.1% (92 of 112) of the studies were reported in discipline-specific journals versus 17.9% (20 of 112) in general medical journals; 68.7% (77 of 112) were single-gate studies and 31.2% (35 of 112) were multiple-gate studies. These proportions did not differ significantly from those for studies published in 2000 and 2004 (P = .41 and .60, respectively). Imaging tests were evaluated in 40.2% (45 of 112) of included study reports, in 43.7% (49 of 112) of laboratory tests, and in 16.1% (18 of 112) of other types of tests. Seven of the 112 included articles (6.3%) explicitly referred to the STARD statement.
Item-specific Adherence to STARD
The percentage agreement for scoring STARD items was 82.5% (132 of 160 items) for the first article evaluated by each reviewer, 88.1% (141 of 160) for the middle article, and 85.6% (137 of 160) for the final article. Adherence to individual STARD items is reported in Table 2. There were large differences between items: only one article reported on methods for calculating reproducibility of the reference standard (item 13b), for example, while all but two articles discussed the clinical applicability of the study findings (item 25), although sometimes only in a general way. For all six items that applied to both the index test and reference standard, information that concerned the index test was better reported.
Of 31 features evaluated (six of the 25 items concern both the index test and reference standard), only three were reported in less than one-quarter of the articles. These referred to methods for reproducibility of reference standard (item 13b), adverse events (item 20), and estimates of reproducibility of reference standard (item 24b).
Our analyses showed that the following features were significantly more often reported in 2012 than in 2004: study identification (item 1), study population (item 3), data collection (item 6), blinded readers of index test (item 11a), statistical methods (item 12), time interval between tests (item 17), distribution of severity of disease (item 18), and estimates of diagnostic accuracy with confidence intervals (item 21). A flowchart was reported by 35.7% (40 of 112) of the studies compared with only 1.6% (two of 124) in 2000 and 12.1% (17 of 141) in 2004 (P < .001). Compared with 2004, the following features were reported significantly less often: participant sampling (item 5), readers of index test (item 10a), and accuracy across subgroups (item 23).
Overall Adherence to STARD
The mean number of STARD items reported was 15.3 ± 3.9 (standard deviation; range, 6–23.5). Overall, 74.1% (83 of 112) of the articles reported more than half of the 25 items, while 9.8% (11 of 112) reported more than 20 items (Fig 2). Significantly more items were reported in studies that were published in general journals than in studies published in discipline-specific journals (17.7 vs 14.8, respectively; P = .002), for single-gate studies compared with multiple-gate studies (16.8 vs 12.1, respectively; P < .001), and for studies that evaluated imaging tests compared with laboratory tests and other types of tests (17 vs 14 vs 14.5, respectively; P < .001) (Fig 3). Repeated analyses with nonparametric instead of parametric testing did not affect conclusions about significance in these three subgroup analyses (P = .003, <.001, and .001, respectively).
In 2000 and 2004, the mean number of STARD items reported was 11.9 and 13.6, respectively. There was a significant increase in completeness of reporting over the years. Articles in 2012 reported, on average, 3.4 more items (95% confidence interval: 2.6, 4.3) than those published in 2000, and 1.7 (95% confidence interval: 0.9, 2.5) more than those published in 2004. Only 41.1% (51 of 124) of the articles in 2000 reported more than half of the 25 items and none reported more than 20, compared with 61.7% (87 of 141) and 2.1% (three of 141), respectively, in 2004 (Fig 2).
Figure 2 shows that the increase in reports of completeness of reporting was not gradual across the studies. The proportion of articles that reported less than half of the STARD items (top left-hand part of Fig 2) has barely changed between 2004 and 2012, which indicates that the lowest quarter, with the poorest reporting, has almost made no improvement at all. In the lower right-hand corner of Figure 2, the difference between 2004 and 2012 is more visible, which indicates that the improvement between 2004 and 2012 is mainly generated by a subset of studies that is substantially more complete in their reporting.
Discussion
We evaluated the extent to which diagnostic accuracy study reports that were published in 12 high-impact-factor journals in 2012 adhered to the STARD list and compared our findings with results from previous, comparable evaluations of articles published in 2000 and 2004 (7,8). We observed that the quality of reporting has slowly but gradually made an improvement, but that completeness of reporting and transparency remain suboptimal in many articles.
This gradual increase in reporting completeness is in line with previous analyses of adherence to STARD. A recent meta-analysis of six of these evaluations showed that studies published after the launch of STARD reported, on average, 1.4 more items (10). All these evaluations were performed in the first few years after the publication of STARD, which may have been too early to expect large improvements. The results of our analysis indicate that the small initial improvement persisted and grew over the years, but also that it is not as large as may have been anticipated.
Over the years, the reporting of many individual STARD items improved, but there is variability, and some domains definitely need further improvement. A quarter of the evaluated articles reported less than half of the STARD items. Many articles do not adequately report on the patient eligibility criteria, recruitment process, and sampling methods. To allow judgments regarding the applicability of study results, such information is crucial.
Some items associated with sources of bias could also benefit from more complete reporting. It is often unclear whether readers of the tests were blinded to clinical information, which prohibits assessment of the risk of review bias. Many articles do not report how many eligible patients failed to undergo the index test or reference standard, which prohibits a judgment about verification bias. The time interval between the index test and reference standard was unclear in half of the articles. Changes in severity of the target condition, or the initiation or withdrawal of medical interventions could occur between tests and influence accuracy estimates.
Although the number of articles that reported confidence intervals around estimates of diagnostic accuracy doubled between 2000 and 2012, it is disappointing that still about a third failed to do so in 2012. Failure to report measures of precision around estimates of accuracy facilitates an overoptimistic, generous interpretation of study results, a phenomenon that is common in diagnostic accuracy study reports (14,15).
As of the publication of this article, all but one of the evaluated journals adopted STARD in their instructions to authors, but adherence is suboptimal for many of them. This may indicate that authors, editors, and peer reviewers do not always recognize a diagnostic accuracy study as such, or that journals have not actively implemented the use of STARD in their editorial and peer-review process. Journal editors and peer reviewers may be actively trained to identify diagnostic accuracy studies and to evaluate quality of reporting. Reporting experts could be invited to peer review study reports. Previous studies have shown that peer reviewers often fail to identify reporting deficiencies in the methods and results of randomized trials (16), and that additional reviews on the basis of reporting guidelines increases the quality of articles (17).
Our study has some potential limitations. We acknowledge that we may have been strict in scoring some items. For example, identification of the study (item 1) was only felt to be satisfactorily handled in a study report when the term diagnostic accuracy was included in the title or abstract. Characteristics of the study population (item 15) were only considered adequately reported when some information (other than age and sex) about presentation of symptoms was also provided. We did this to compare our results with those of analyses of articles published in 2000 and 2004. Other items, especially those with the lowest adherence rates, may not always be applicable. Adverse events (item 20), for example, are not an issue for most imaging tests, and the reproducibility of the reference standard (item 13b and 24b) is often well established. STARD was launched in 2003 and an update is underway. Although there were no major improvements, to our knowledge, of concepts of study design and sources of bias since then, some of the items on the current list may be outdated and redundant, while other relevant items may be absent.
None of the reviewers involved in this evaluation analyzed the articles published in the 2000 and 2004 analyses, but we made considerable efforts to achieve comparability with these previous evaluations. Nevertheless, it is possible that features were interpreted somewhat differently.
We only included studies that were published in journals with an impact factor above 4. In other fields of research, quality of reporting was shown to be lower in journals with lower impact factors (18). We included studies that evaluated diagnostic accuracy, even if this was not their primary objective. We decided to do this because primary and secondary objectives are often not explicitly reported (19), and because we believe that any estimate of test accuracy should be accompanied by sufficient information to evaluate its validity and applicability. Because only one of 12 selected journals did not explicitly adopt STARD, we were unable to analyze differences between journals that adopted these standards and journals that did not.
Medical tests are the basis for almost every clinical decision. The tests we rely on are usually not perfect, and patients with the targeted condition may have a negative test result while other patients test positive and do not have the condition. When clinicians order tests and interpret test results, they should consider the likelihood that such errors occur. In modern evidence-based medicine, this should not be on the basis of hearsay or personal experience; it should be informed by the results of diagnostic accuracy studies. However, readers will only be able to identify sources of bias and appreciate limitations regarding the applicability of study results to their own setting when the reporting is honest, transparent, and complete. We strongly encourage authors to use STARD to report their diagnostic accuracy studies, and we encourage editors and peer reviewers to stimulate, encourage, or remind authors to do so as well.
Advances in Knowledge
■ Diagnostic accuracy study reports published in 2012 reported, on average, 15.3 of the 25 items on the Standards for Reporting of Diagnostic Accuracy list.
■ Significantly more items were reported for single-gate studies compared with multiple-gate studies (16.8 vs 12.1, respectively; P < .001) and for studies that evaluated imaging tests compared with laboratory tests and other types of tests (17.0 vs 14.0 vs 14.5, respectively; P < .001).
■ There was a significant improvement of 3.4 items (95% confidence interval: 2.6, 4.3) compared with reports published in 2000 and of 1.7 items (95% confidence interval: 0.9, 2.5) compared with reports published in 2004.
■ A flowchart was reported for 36% (40 of 112) of the studies, compared with only 2% (two of 124) in 2000 and 12% (17 of 141) in 2004 (P < .001).
■ The reporting of items related to the risk of bias and the applicability of study results is in need of further improvement.
Author Contributions
Author contributions: Guarantor of integrity of entire study, D.A.K.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, D.A.K., J.W., W.A.V.E., L.H., N.S.; statistical analysis, D.A.K., P.M.M.B.; and manuscript editing, all authors
References
- 1. . The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Clin Chem 2003;49(1):7–18.
- 2. . Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology 2003;226(1):24–28.
- 3. . Reporting the accuracy of diagnostic tests: the STARD initiative 10 years on. Clin Chem 2013;59(6):917–919.
- 4. . Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282(11):1061–1066.
- 5. . QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155(8):529–536.
- 6. ; QUADAS-2 Steering Group. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol 2013;66(10):1093–1104.
- 7. . Quality of reporting of diagnostic accuracy studies. Radiology 2005;235(2):347–353.
- 8. . The quality of diagnostic accuracy studies since the STARD statement: has it improved? Neurology 2006;67(5):792–797.
- 9. . There is nothing staid about STARD: progress in the reporting of diagnostic accuracy studies. Neurology 2006;67(5):740–741.
- 10. . Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evid Based Med 2014;19(2):47–54.
- 11. . Publications on diagnostic test evaluation in family medicine journals: an optimal search strategy. J Clin Epidemiol 2000;53(1):65–69.
- 12. . Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51(8):1335–1341.
- 13. . Reproducibility of the STARD checklist: an instrument to assess the quality of reporting of diagnostic accuracy studies. BMC Med Res Methodol 2006;6:12.
- 14. . Overinterpretation and misreporting of diagnostic accuracy studies: evidence of “spin”. Radiology 2013;267(2):581–588.
- 15. . Overinterpretation of clinical applicability in molecular diagnostic research. Clin Chem 2009;55(4):786–794.
- 16. . Impact of peer review on reports of randomised trials published in open peer review journals: retrospective before and after study. BMJ 2014;349:g4145.
- 17. . Effect of using reporting guidelines during peer review on quality of final manuscripts submitted to a biomedical journal: masked randomised trial. BMJ 2011;343:d6783.
- 18. . A systematic scoping review of adherence to reporting guidelines in health care literature. J Multidiscip Healthc 2013;6:169–188.
- 19. . Publication and reporting of test accuracy studies registered in ClinicalTrials.gov. Clin Chem 2014;60(4):651–659.
Article History
Received May 16, 2014; revision requested June 27; revision received July 24; accepted August 19; final version accepted August 29.Published online: Oct 27 2014
Published in print: Mar 2015