Abstract
Background: Randomized controlled trials (RCTs) are thought to provide the most accurate estimation of “true” treatment effect. The relative quality of effect estimates derived from nonrandomized studies (nRCTs) remains unclear, particularly in surgery, where the obstacles to performing high-quality RCTs are compounded. We performed a meta-analysis of effect estimates of RCTs comparing surgical procedures for breast cancer relative to those of corresponding nRCTs.
Methods: English-language RCTs of breast cancer treatment in human patients published from 2003 to 2008 were identified in MEDLINE, EMBASE and Cochrane databases. We identified nRCTs using the National Library of Medicine’s “related articles” function and reference lists. Two reviewers conducted all steps of study selection. We included studies comparing 2 surgical arms for the treatment of breast cancer. Information on treatment efficacy estimates, expressed as relative risk (RR) for outcomes of interest in both the RCTs and nRCTs was extracted.
Results: We identified 12 RCTs representing 10 topic/outcome combinations with comparable nRCTs. On visual inspection, 4 of 10 outcomes showed substantial differences in summary RR. The pooled RR estimates for RCTs versus nRCTs differed more than 2-fold in 2 of 10 outcomes and failed to demonstrate consistency of statistical differences in 3 of 10 cases. A statistically significant difference, as assessed by the z score, was not detected for any of the outcomes.
Conclusion: Randomized controlled trials comparing surgical procedures for breast cancer may demonstrate clinically relevant differences in effect estimates in 20%–40% of cases relative to those generated by nRCTs, depending on which metric is used.
Randomized controlled trials (RCTs) have been regarded as superior to the nonrandomized observational studies (nRCTs) because of the biases that can arise in nRCTs and the diverging results of nRCTs relative to RCTs. For example, laparoscopic repair of inguinal hernia gained popularity over open repair as a result of favourable results in nRCTs. Subsequently, numerous RCTs have been performed and assessed in a meta-analysis that concluded that the recurrence rate after laparoscopic inguinal hernia repair was no different than that for open repair. The RCTs also warned the surgical community of significantly higher rates of major visceral complications from a laparoscopic approach. Such unexpected findings of RCTs compared with nRCTs have led to a call for careful examination of surgical research, including increased use of RCTs.1,2
Studies by Concato and colleagues,3 Benson and Hartz4 and Ioannidis and colleagues5 have renewed the debate regarding the role of RCTs relative to nRCTs. The first 2 groups reported that nRCTs provide similar results to RCTs and argue for an expanded role for nRCTs, particularly to exploit clinical databases. Meanwhile, Ioannidis and colleagues5 reported reasonable correlation between RCTs and nRCTs in many instances, but also found a greater than 2-fold difference in odds ratios in 33% of topics examined. Surgical studies made up a small proportion of the topics compared. Shikata and colleagues,6 using a similar methodology to Ioannidis and colleagues, reported on a comparison of effects in RCTs versus nRCTs of digestive surgery. The authors reported that one-quarter of nRCTs gave different results than RCTs. Similar to the studies by Concato and colleagues and Ioannidis and colleagues, Shikata and colleagues included only studies for which meta-analyses of RCTs were available, and they included comparisons between surgical procedures and nonsurgical treatments. Surgical RCTs, as a result of being infrequently conducted, may be somewhat under-represented in meta-analyses. Therefore, studies such as that of Shikata and colleagues may capture only a subset of surgical RCTs.
A study specifically designed to examine the effect estimates for surgical procedures alone in RCTs and nRCTs has, to our knowledge, not been performed. MacLeod7 and Hall and colleagues8 have highlighted methodologic issues, such as standardization of procedures, outcome assessment and sample size, which suggest that surgical trials may be different than nonsurgical trials. Concato and colleagues3 noted that observational investigations of surgical operations may be more prone to selection bias. This raises the possibility that discrepancies of effect estimates between RCTs and nRCTs examining surgical versus nonsurgical interventions may not reflect potential discrepancies in studies comparing 2 surgical interventions.
Given that the paradigm for adopting new surgical procedures differs somewhat from that of pharmacologic therapies and that nRCTs continue to play an important role in informing decisions about adoption of surgical therapies, there is an ongoing need to formally compare effect estimates from nRCTs with those from RCTs.
The purpose of our study was to determine whether the conclusions reached from an assessment of the body of knowledge obtained from nRCTs would be consistent with those obtained from RCTs, with specific focus on a surgically treated disease. Would decisions based on the collective evidence from nRCTs be consistent with that obtained from the “highest level of evidence,” an RCT addressing that same topic? We performed a systematic review and meta-analysis to compare the effect estimates of RCTs with those of nRCTs in the domain of breast cancer surgery. This comparison will help us to better understand the relation between study design, the resulting effect estimate and the subsequent impact on evolving knowledge of surgical procedures.
Methods
Search for RCTs
Our first goal was to identify all existing English-language RCTs published between January 2003 and May 2008 that included 2 surgical arms for the treatment of breast cancer. We limited our search to this 5-year period with the hope of identifying RCTs for which there would likely be a large cohort of earlier nRCTs available for comparison. We identified RCTs through a search of MEDLINE, EMBASE and the Cochrane databases. In MEDLINE and EMBASE, we used a sensitive search strategy proposed by the Cochrane Collaboration for identifying RCTs. Where possible, the databases were searched by linking 2 broad content areas: breast cancer and surgical procedures. Medical Subject Headings (MeSH) were used in the MEDLINE search, including the terms “neoplasm,” “breast neoplasms” and “surgical procedures, operative.” The EMBASE search consisted primarily of a keyword search, but we used EMTREE terms, such as “surgical technique,” “breast tumor” and “breast surgery.” This strategy was modified for searches in the Cochrane database. Detailed search strategies are available on request.
Two reviewers (J.P.E., and either E.J.K. or A.J.G.) independently screened citations by title and abstract to identify studies for full-text review. To be included as eligible RCTs, studies had to be truly randomized and had to involve comparisons of 2 surgical procedures used for the treatment of breast cancer. Individual full text articles were then independently reviewed to determine eligibility. Disagreements were resolved by consensus. We then searched the reference lists of included articles and consulted a clinical expert to identify other relevant RCTs.
Search for nRCTs
We identified nRCTs using a number of strategies. For each identified RCT, we used the “related articles” function of the National Library of Medicine database within PubMed to generate an extensive list of abstracts for subsequent screening. This “related article” search function identifies articles using a probabilistic content similarity algorithm based primarily on abstract text. This feature supports a qualitative approach to exploring large document collections.9 Next, we performed a manual search of reference lists of previously identified RCTs. The 2 reviewers then independently screened reference citations by title and abstract to identify studies with comparable populations, interventions and comparators to the matched RCTs and to determine which ones would be eligible for full text review. Satisfaction of all of the following criteria led to inclusion of nRCTs: publication in English, nonrandomized study design and comparison of 2 groups that were comparable to those represented in the relevant RCT. We excluded nRCTs that used a historical control group. As the purpose of our study was to determine whether the conclusions reached from an assessment of the findings from nRCTs published before a corresponding RCT, we included nRCTs without a formal evaluation of study quality.
The outcomes of interest were determined after identifying matched groups of RCTs and nRCTs in a hierarchical manner. If mortality or recurrence were assessed in both the RCTs and nRCTs, these were employed as the outcome for analysis. If these data on mortality or recurrence were not available from both study types, we used objectively measured outcomes found in both study types. Finally, if objectively measured outcomes were not available, we used subjectively assessed or self-reported outcomes as a last resort.
Data extraction
As primary outcomes were not specified in all of the RCTs, outcomes were chosen for each RCT/nRCT comparison in an inclusive manner, extracting data for outcomes represented in both study types in which event rates permitted calculation of relative risk (RR). For each study, we extracted information on the year of publication, number of patients in each study arm, duration of follow-up and number of events in each arm of the outcome(s) of interest.
Statistical analysis
To compare the effect estimates of the RCTs and nRCTs for each outcome identified, the data from the RCTs and nRCTs were combined within study type grouping to generate pooled effect estimates. A summary RR for each study type and outcome was determined using a DerSimonian and Laird random effects model. Heterogeneity was assessed using the Q score and was considered to be significant at p < 0.10. Statistical analysis was performed using Stata software version 9.2 (StataCorp).
As there are no established metrics for comparison of RCTs with nRCTs, we employed both qualitative and quantitative approaches. We used a compilation of our 4 methods to draw conclusions from our data and did not prioritize any one over another. First, we performed a visual inspection of the results presented in a summarized forest plot format. An informal consensus process among all investigators was used to determine agreement or disagreement for each outcome. Second, we defined discrepancies based on differences in the relative magnitude of treatment. The RRs of nRCTs were deemed to be clinically significantly different from the RRs of the RCTs if they were at least double or less than half the RR of the RCTs. We selected this as the most clinically relevant comparison measure using the precedent found in the studies of Ioannidis and colleagues5 and Shikata and colleagues.6 Third, we evaluated whether there was a failure to demonstrate consistency of the statistical differences shown by the summary RRs. Agreement using this method implies that either both study types reflected no statistical difference in treatment effect or that there was a statistical difference shown and in the same direction for both study types. A similar method has been described by Barraclough and Govindan.10 Finally, we performed a z score comparison of RR estimates from RCTs versus nRCTs. This measure is somewhat underpowered to detect clinically significant differences between pooled RRs involving small sample sizes, but was also reported in the studies by Ioannidis and colleagues5 and Shikata and colleagues.6
Results
Identification of breast cancer RCTs
Our database search strategy yielded a total of 4805 candidate abstracts. These were screened by 2 reviewers to identify 15 RCTs satisfying our inclusion criteria (Fig. 1).
Identification of corresponding nRCTs
A varying number of candidate nRCTs were identified (ranging from 173 for outcome 1 to 4315 for outcome 2; Table 1).11–48 The candidate articles were then evaluated by 2 independent reviewers to identify nRCTs addressing each of the topics represented by the previously identified RCTs and to determine whether inclusion criteria were met. The yield of this multistep search is described in Table 1. More detailed flow charts describing article selection for nRCTs for each topic are available on request.
Among the 15 RCTs identified, 3 studies49–51 were excluded from further analysis because no relevant nRCTs were identified. For the comparable RCTs and nRCTs, a range of 1–3 matched outcomes were identified, resulting in 10 outcomes available for comparison (Fig. 1). The clinical topics addressed by these RCT/nRCT combinations included some of the major controversies in breast cancer surgery of recent years, such as breast conserving surgery versus modified radical mastectomy and sentinel lymph node biopsy versus axillary lymph node dissection. Other topics included whether or not preserving the intercostobrachial nerve had substantial impact on sensory outcomes and whether there was a significant difference in lymph node harvest and complication rate with preservation of the pectoralis minor muscle.
Comparative meta-analysis of RCTs and nRCTs by topic
Random effects meta-analysis was performed for each of the 10 outcomes to produce pooled RR estimates for RCTs and nRCTs. Detailed forest plots presenting these individual topic/outcome analyses are presented in Appendix 1, available at cma.ca/cjs. Summarized results showing pooled RCT and nRCT RR estimates for all topics and outcomes are presented in Figure 2.
The summary RRs for each outcome were evaluated for concordance between the nRCTs and RCTs using our 4 methods of comparison (Table 2). On visual inspection of the RR estimates in Figure 2, 6 of 10 comparisons appeared to have generally close agreement, whereas 4 had apparent discrepancies (outcomes 2, 3, 6 and 8). If the formal definition of disagreement is a greater than or equal to 2-fold difference in RR, only 2 of 10 outcomes met the definition for substantial disagreement (outcomes 2 and 3). When failure to demonstrate consistency of the statistical differences was considered, 3 of 10 topics showed disagreement in the directionality or presence of treatment effect (outcomes 3, 5, 6). The z score resulted in no statistically significant differences in treatment effect size.
Discussion
The importance of evaluating innovative surgical procedures has become increasingly recognized,52 with the role of RCTs compared with nRCTs under debate.3–6 The present study demonstrates that the effect estimates comparing 2 surgical procedures for breast cancer in RCTs and corresponding nRCTs showed clinically important differences in 20%–40% of cases. The proportion of clinically important differences varied depending on which of our metrics was used.
Our study extends and expands on the work of Concato and colleagues,3 Benson and Hartz,4 Ioannidis and colleagues5 and Shikata and colleagues6 as, to our knowledge, it is the first to explicitly examine comparisons of 2 surgical procedures rather than medical versus surgical treatment and to compare RCT versus nRCT studies not previously included in a meta-analysis of breast cancer surgery. Interestingly, our results are similar to previous published results. Ioannidis and colleagues,5 who examined primarily medical topics included in a previously published meta-analysis, found a greater than 2-fold difference in odds ratios (OR) in 29% of outcomes. This result is comparable to our finding of disagreement for 2 of 10 (20%) comparisons in breast cancer surgery. Shikata and colleagues,6 who reported on digestive surgery topics previously studied in meta-analyses, found a greater than 2-fold difference in effect estimates for 7 of 16 (44%) of topics.
One of the challenges in this type of study is that an accepted definition of agreement of effect estimates between RCTs and nRCTs does not exist.53 We chose to consider several measures, including visual inspection, magnitude of difference in effect estimates, failure to demonstrate consistency of the statistical differences, and the z score for statistical differences.
Visual inspection of results, although qualitative, is an important first step in data analysis. The results of our study demonstrated discrepancies in 4 of 10 outcomes. A qualitative analysis such as this may lack interobserver reliability, but it does suggest divergent effect estimates between RCTs and nRCTs in a substantial proportion of comparisons.
Examining the results of failure to demonstrate consistency of the statistical differences has an intuitive appeal as it examines whether nRCTs comparing surgical procedures are identifying beneficial, harmful or ineffective treatments in the same manner as RCTs. Our study demonstrated discrepancies in 3 of 10 outcome comparisons. Although not previously used as a measure of agreement, clear divergence in effect estimates with regards to benefit or harm could impact treatment recommendations.
The final measure of agreement we used was a 2-fold or greater difference in RR. This measure of agreement has been used previously in studies similar to ours and has clear face validity. Our results demonstrated that in 20% of outcomes the effect estimate of nRCTs compared with RCTs differed by more than 2 times. We believe that a discrepancy in effect estimates of this magnitude could influence surgical practice and clinical practice guidelines in a divergent manner.
Limitations
An important caveat to our study, and to the preceding studies by Ioannidis and colleagues5 and Shikata and colleagues,6 is that we present pooled RRs and head-to-head comparisons of RCT and nRCT findings in the presence of heterogeneity across studies. For some outcomes, there was significant heterogeneity of results across studies (see Appendix 1, Figs. S1–S10). We found significant heterogeneity across studies for 2 of 5 RCT meta-analyses that included more than 1 RCT for breast cancer surgery and for 3 of 5 nRCT meta-analyses that we performed. Meanwhile, the study by Ioannidis and colleagues5 revealed significant heterogeneity in 23% of the RCT meta-analyses and 41% of the nRCT meta-analyses. Shikata and colleagues6 noted a similar estimate of heterogeneity (41%) in observational studies. The use of random-effects models partially mitigates this concern. Although it was not the intention of our study to make inferences with regard to efficacy of therapies, the pooling of studies with significant heterogeneity could still be criticized. Such pooling is not standard in rigorous meta-analysis owing to the risk of pooling unequal confounding variables and, thus, is a limitation of our approach.
A potential further limitation to our study was the inclusion of RCTs published only in English and within a recent 5-year time period. A similar time and language limitation was applied to our process of identifying comparable nRCTs. The recent 5-year period was selected with the hopes of including higher-quality RCTs, which more likely would have been published recently. The more recent trials were also felt to have a larger number of related nRCTs. The selection of nRCTs involved manual searches and using the PubMed filter for related articles. This strategy may have missed studies that a more exhaustive search strategy would not have missed.
Despite these limitations, our study expands on and extends the findings of previous studies because, to our knowledge, we focused on a surgical area not previously assessed by such studies. We extend the work of Shikata and colleagues,6 which included only studies that were previously included in meta-analyses. Most surgical studies have not been included in meta-analyses owing to the small numbers of studies being performed in each topic area, thus our study methods were more inclusive in this respect. We also included a broader definition of agreement/disagreement, focusing on factors important to surgeons in clinical decision-making. We also perhaps reach a different global conclusion than Benson and Hartz,4 whose global conclusions hinted that nRCTs approximate RCTs much of the time and that RCTs may thus not always be needed. Our study findings in the domain of breast cancer surgery underline that effect estimates may differ by more than 2-fold in at least 20% of RCT versus nRCT comparisons.
Conclusion
Although nRCTs often produce comparable effect estimates to RCTs, in 20% of cases the results will differ. As we cannot predict when an nRCT will be misleading, the surgical community should strive to conduct more RCTs of surgical therapies wherever feasible.
Acknowledgements
We thank Dr. Daphne Mew for her contribution as an expert reviewer of the list of breast cancer RCTs. W.A. Ghali is supported by a Canada Research Chair in Health Services Research and by a Senior Health Scholar Award from the Alberta Heritage Foundation for Medical Research.
Footnotes
This paper is based on a previous communication at the Canadian Surgery Forum 2009. A 10-minute podium presentation summarized interim results of our study.
Competing interests: As above for W.A. Ghali; none declared for all other authors.
Contributors: J.P. Edwards, W.A. Ghali and A.J. Graham designed the study. J.P. Edwards, E.J. Kelly and A.J. Graham extracted data, whereas Y. Lin and T. Lenders developed and executed the search strategies. J.P. Edwards, E.J. Kelly and W.A. Ghali analyzed the data. J.P. Edwards and A.J. Graham prepared the manuscript, which E.J. Kelly, Y. Lin, T. Lenders, W.A. Ghali and A.J. Graham reviewed. All authors approved the manuscript submitted for publication.
- Accepted November 24, 2010.