Abstract
Background: There is no systematic review of the methodological quality of randomized controlled trials (RCTs) of teaching surgical and emergency skills to undergraduates.
Methods: We searched the Cochrane Collaboration Controlled Trials Register, the Cochrane Database of Systematic Reviews, MEDLINE, EMBASE, ERIC, DARE and the University of Toronto Continuing Medical Education database for RCTs in all languages.
Results: We identified 19 RCTs. Four tested methods of IV access, 1 found intraosseous access faster than the umbilical vein in neonates, and 1 found that one type of intraosseous needle had higher success rates. Two RCTs of intubation skills did not identify a superior technique. One RCT of CPR found video instruction superior to the American Heart Association Heartsaver course. Of 2 RCTs of trauma skills, 1 found no improvement and 1 found improvement only on the day of instruction. One RCT found both computer and seminar training improved epistaxis management. One RCT gave students preoperative anatomy instruction, and they received higher ratings from surgeons. One RCT asked students to study surgical scenarios preoperatively, and they improved their surgical intensive care unit skills. One RCT gave students video and paper-cut instruction of the Whipple procedure; both groups improved, but there were no differences between groups. One RCT taught uteteroscopy and stone extraction and found groups that used low- and high-fidelity bench models improved, compared with the didactic group. Four of 5 RCTs of knot tying showed improvement.
Conclusions: This systematic review assessed the quality of RCTs used in teaching undergraduates surgical and emergency skills. There are many positive study outcomes, but there are significant methodological weaknesses in the study design. Students varied in their skills, and most did not demonstrate optimal performance in any of the procedures. This review provides a baseline for further work important to both medical education and clinical practice.
There are several randomized controlled trials (RCTs) of teaching undergraduates fundamental surgical and emergency skills that intended to assess whether the teaching of these skills can be improved, but there is no systematic review that evaluates the quality and findings of this teaching. We use the international Quality of Reporting of Meta-Analyses (QUOROM)1 statement to assess the methodological quality of these RCTs, possible sources of bias and whether the conclusions drawn by the authors can be relied on by surgical teachers to improve the teaching of under-graduates.
Methods
Literature search
For RCTs in all languages, we searched the Cochrane Controlled Trials Register, using the term “medical student.” We searched MEDLINE using the following terms: “medical students” and “randomized controlled trials” or “systematic reviews,” “meta-analysis,” “crossover studies,” “intervention studies,” “Latin squares,” “factorial,” “multicentre studies,” “cohort studies,” “prospective studies” or “longitudinal studies” (and spelling variations of these terms). We used similar terms to search DARE, EMBASE, the University of Toronto Continuing Medical Education database and ERIC.
Study selection
Two reviewers independently assessed whether the study was an RCT and taught surgical or emergency procedures to medical students. We excluded any study where the outcomes for medical students could not be separated from other health professional groups, or where a common fundamental surgical or emergency procedure was not taught. Surgical procedure simulators and virtual reality simulators are generally not used to teach fundamental procedures to undergraduates worldwide, thus we excluded RCTs of simulators.
Validity assessment
All studies that appeared from their titles or abstracts to be RCTs, or where the abstract did not reveal a decision about the study design, were evaluated by independent assessment of the full text of each study.
RCTs were categorized according to the criteria of the Cochrane Collaboration Reviewers’ Handbook2 as having low, moderate or high risk of bias according to methodological strength. We based our estimate of bias on the 4 Cochrane criteria for minimizing bias:
1) Selection bias. We assessed the study as being at low risk of bias if participants were randomly assigned to experimental or control groups. We assessed whether randomization was concealed from the experimenters.
2) Performance bias (inadequate delivery of the intervention). We noted whether a process analysis was performed, to assess whether the interventions were fully delivered to all participants according to the study protocol. We also assessed whether membership in the intervention or control groups was blinded to the participants and experimenters.
3) Attrition bias. If an analysis was not performed or if known biasing effects of attrition were not adjusted for in the analysis, attrition bias was considered likely. In this case, the study was considered to be at moderate risk of bias.
4) Ascertainment bias (if studies did not use the same methods of ascertainment for both experimental and control groups). We also assessed whether ascertainment of outcomes in the intervention or control groups was blinded to the experimenters.
We assessed 3 additional aspects of study design that affect the quality of RCTs and that are also common problems in the field of medical education studies:
5) Inadequate sample size. If the results for the key hypotheses were statistically significant, the study was assessed as at low risk of bias for type II error, even if the study did not have a power computation. If the results were negative and there was no power computation, we assessed the study as at risk of type II error.
6) Intention-to-treat analysis. If the authors did not plan an intention-to-treat analysis and there was no attrition analysis showing that loss of subjects from the experimental and control groups did not affect the outcomes, we assessed the study as being at risk of overestimating the effects of interventions.
7) Statistical bias. Studies that randomize by cluster (group) but analyze at the level of the individual are at risk of drawing false positive conclusions because part of the outcome may be due to discussions between class members. The cluster is now the sampling unit and not the individual. Failure to take account of clustering and the size of interclass correlations may lead to inadequate sample size and the risk of drawing false nonsignificant conclusions (type II error).3–6 We assessed studies as at moderate risk of bias if they did not control for clustering.
The Cochrane Reviewers’ Handbook2 recommends the following approach to summarizing the risk of bias: Assessment of the risk of bias in RCTs:
Low risk – plausible bias unlikely to seriously alter the results; all of the criteria met.
Moderate risk – plausible bias that raises some doubt about the results; 1 or more criteria partly met.
High risk – plausible bias that seriously weakens confidence in the results; 1 or more criteria not met.
The Handbook states the following:
The relationships suggested above will most likely be appropriate if only a few assessment criteria are used and if all the criteria address only substantive, important threats to the validity of study results.2
Based on our assessment that these 4 sources of bias and 3 additional aspects of study quality might threaten the validity of a study, studies were assigned to 3 categories: low risk, moderate risk and high risk of bias. In synthesizing the results, conclusions were based on those with low or moderate risk of bias.
Data were independently extracted by the 2 reviewers, and discussion continued until agreement was achieved.
Based on considerable heterogeneity in study design, intervention, outcome measures and statistical reporting, we determined that quantitative synthesis was not appropriate, and we used a narrative systematic review.
Results
Trial flow
We identified 88 potential RCTs. On examination of the full text, 21 were excluded because they were not RCTs7–27 and 48 because they were either not on the topic of learning procedures in surgery and emergency medicine, or the outcomes for medical students could not be identified. 28–75 Nineteen RCTs remained for evaluation (Table 1).76–94 No replications of RCTs were identified (see Fig. 1 for QUOROM flow chart).
Methodological quality
Strengths
The strengths of these 19 RCTs are as follows: a) complete delivery of the experimental stimuli, because the interventions were closely supervised by faculty; b) minimal drop-outs, because the students completed the experiments as part of required university courses. (Although no study planned an intention-to-treat analysis, most accomplished one because of these high completion rates.); and c) although most studies did not blind students, researchers or assessors, most of the outcomes were performance times, success in a procedure, or scores on a multiple choice questionnaire (MCQ); thus the absence of blinding might have introduced little bias, except where surgeons assessed the quality of work using personal judgement and were unblinded Table 2 summarizes each study’s methodological quality.
Weaknesses
The weaknesses of the RCTs are as follows: a) although all 19 were described in the text as RCTs, only 5 described the actual method of randomization (Abe et al76 used coin toss, Ali et al77 alternated removing names from a box, the students in the study by Carr et al78 chose from facedown cards, Matsumoto et al79 chose by candidate number and Todd et al80 chose from a table of random numbers). Only Abe and colleagues76 and Todd and colleagues80 used a strong method of randomization. Only From and colleagues81 and Todd and colleagues80 concealed randomization and only Matsumoto and colleagues79 reported a power computation, and Carr and colleagues reported a posthoc computation.78 None of the studies blinded participants; 4 blinded instructors (From et al,81 Matsumoto et al,79 Rogers et al,95 Todd et al80), and 4 blinded assessors (From et al,81 Hong et al,82 Matsumoto et al,79 Todd et al80). None of the studies described possible cointerventions during the study period (these would have been improbable during brief experiments but likely during rotations lasting up to 12 wk). If students were assigned to groups, it was difficult from the descriptions to assess how much communication between students occurred during the interventions and, thus, how much of the result was due to learning from fellow students, in addition to the program. Based on the description of how groups operated, it appears that 7 did not allow for the effects of clustering in the statistical analysis (Carr et al,78 From et al,81 Gilbart et al,83 Rogers et al,84,85 Summers et al,86 Todd et al80). In assessing whether these RCTs are subject to bias, failure to deliver the intervention would be minimal because of the close observation and certainty that the interventions were delivered to all participants; bias due to attrition would also be minimal because nearly all participants completed the experiments. Weight should be given to the low incidence of blinding and concealment, the absence of power computations (inadequate sample size could be the reason for nonsignificant results) and failure to adjust for clustering in the statistical analysis (Table 3).
Previous research
Four authors state that they identified no previous studies to guide their research (From et al,81 Hong et al,82 Matsumoto et al,79 Talan et al88); 5 studies conducted literature searches and analyzed the studies to guide their own research design (Carr et al,78 Rogers et al,84,93,95 Summers et al86; 9 authors cited studies but did not build on them to improve the design or execution of their studies or only cited them in the concluding discussion section (Abe et al,76 Ali et al,77 Gilbart et al,83 Jun et al,89 Mann et al,91 Petroianu et al,90 Rogers et al,85 Rogers et al,94 Todd et al80). One author (Engum et al87) cited none of the relevant previous RCTs identified in our review. None of the studies had a section titled “literature search,” none stated which databases were searched or the search terms used, and none mentioned whether they consulted a health sciences librarian or expert to identify studies.
Although medical students vary in their skills and application, no author described difficulties or ease in instruction, thus we can learn nothing from these studies about which aspects of the procedures were more difficult for some students, how to help students who encounter problems or whether the instruction technique for these procedures requires modification or improvement to help specific students.
Outcome measures
Most of the outcome measures had face validity (Table 2). When authors used scales, few described their scales in sufficient detail or printed them out in the text so that readers could understand each step in the learning process.
As outcome measures, Abe and colleagues,76 Jun and colleagues89 and Talan and colleagues88 used time taken and success in identifying the vein, and Engum and colleagues87 described their success in accessing the vein. Petroianu and others90 measured time to intubation and success, and Todd and others80 assessed intubation on a 5-point scale from 1 (not competent) to 5 (outstanding).
Researchers who used objective structured clinical exam (OSCE) stations used OSCE scores; Gilbart and colleagues,83 Rogers and colleagues94 and Ali and colleagues77 used trauma evaluation and management (TEAM) protocol scores to assess the adequacy of advanced trauma life support (ATLS) learning; Hong and colleagues82 used surgeons’ performance ratings, and Mann and colleagues91 used scores of understanding the anatomic steps in the Whipple procedure. Rogers and colleagues95 evaluated students’ knot tying by independent evaluation of videotapes by 3 surgeons, using a 7-point rating scale. Three other studies93,84,85 used independent evaluations, with 3 surgeons using a 24-point scale.
Several authors used multiple aspects of evaluation: From and colleagues81 rated airway management skills from 4 (excellent) to 1 (inadequate) for overall mask airway skill, overall intubation skill, tooth pressure, initial tube placement and efficient ventilation after placement. Evaluators classified patients from 1 (easy to intubate) to 4 (difficult to intubate) and from Mallampati class 1 (soft palate, fauces, uvula, pillars visible) to class IV (soft palate not visible). Carr and colleagues78 assessed knowledge of the technique of anterior nasal packing with a 17-item short answer test and performance on a model assessed by a 16-item checklist; Matsumoto and colleagues79 measured skills in removal of a midureteral stone with a semirigid ureteroscope and a basket by a global rating scale, a checklist, a pass rating and the time needed to complete the task. Summers and colleagues86 rated each suturing technique on instrument handling, body position, accuracy, tightness, alignment and time to perform. Students also took a 50-item multiple choice test.
Only 2 authors used evaluative terms to describe the students’ performance: Todd and colleagues80 described intubations as not competent to outstanding, and From and colleagues81 described airway management skills as inadequate to excellent. It was not the stated purpose of any author to set standards for undergraduate achievement but, rather, to find more efficient ways to teach undergraduates. The setting of ideal and minimal safe scores is a topic for further research.
Overall risk of bias
In assessing the overall risk of bias in this group of 19 RCTs, we estimated that those studies that obtained significant results (thus not subject to type II error) are at low risk if the study randomized participants to individual tasks rather than to clusters and at moderate risk if clustering in the sample could have contributed to the outcomes and the effects of clustering were not adjusted for.
There were 2 RCTs of IV access. Engum and colleagues87 found no differences between groups who learned on a computer simulator and by self-study, and Talan and colleagues88 found no differences in success at IV access by cutdown at the cephalic or saphenous vein.
There were 2 RCTs of interosseous access. Abe and colleagues76 found IV access in neonates was faster with the intraosseous route than the umbilical vein. Jun and colleagues89 found that, after training, intraosseous access was more successful with the Cook Sur-Fast screw-tipped needle than with a standard bone marrow needle.
Because most of the outcomes were performance times, the absence of concealment and blinding might not have biased these results. None of the RCTs performed a power computation. The absence of significant results and of a power computation in Engum and colleagues87 and Talan and colleagues88 place these 2 RCTs at moderate risk of bias from type II error.
There were 2 RCTs of airway management. From and others81 found no differences in success at intubating patients undergoing general anesthesia between the groups who took the American Heart Association self-study course and those who received a 1-hour lecture by anesthesiologists. Petroianu and others90 found no differences with the Trachlight between the nasal or oral route in time to intubation or success by the tenth attempt to intubate. The absence of significant results and a power computation in From and colleagues81 and Petroianu and colleagues, 90 and failure to allow for a clustered design in From and colleagues81 place both at moderate risk of bias.
There was 1 RCT of CPR. Todd and others80 found that those who took a 34-minute self-instruction CPR video were more competent when tested 100 days later than were those who took the American Heart Association Citizen Heartsaver CPR course A. The absence of statistical allowance for the clustered design places Todd and colleagues80 at moderate risk of bias.
There are 2 RCTs of trauma assessment. Ali and colleagues77 found that students who took the TEAM trauma evaluation and management program had a statistically significant improvement in scores, compared with the control groups (p < 0.0001) on the day of instruction, but no longer-term follow-up was undertaken. Gilbart and others83 found no differences on trauma or nontrauma OSCE scores between groups who received computer and seminar instruction. Because of the absence of significant results, the absence of a power computation and the failure to statistically adjust for the cluster design, the study by Gilbart and others83 is at moderate risk of bias.
There were 10 RCTs of surgical technique. Carr and colleagues78 found that the groups that received instruction on anterior nasal packing for epistaxis by self-instruction on a computer or by face-to-face seminars had higher scores, compared with their pretests (p < 0.05). Hong and others82 found that students who viewed interactive anatomy software about 2 operations they were to witness had higher test scores. Mann and others91 found both groups of students who were taught about the Whipple procedure, either by watching a video or by cutting out paper shapes, improved in their understanding of anatomic relations, but there were no differences between the groups. The validity of the question-naire used in determining test scores was not tested, but the authors had piloted the approach in a previous study of inguinal hernia repair.92
Matsumoto and colleagues79 found that groups that practised ureteric stone extraction with either an inexpensive or an expensive (high fidelity) model supervised by a urologist had significantly higher scores than a group that received didactic instruction. Rogers and colleagues94 found that students who studied clinical scenarios before their surgical intensive care elective improved their average test scores (p < 0.0001) on 3 of the OSCE stations (intubation, ventilator, hypotension) but not on the stations for airway, breathing and circulation or pulmonary artery data.
Rogers and others95 tested several methods of teaching students to tie knots and found that a) a group that was shown the correct method and errors in tying knots improved their scores (p < 0.01), but groups shown no errors and only correct methods and those shown only errors did not improve95; b) students who took a computer-assisted teaching session with individualized feedback from surgical faculty had greater improvement than did those who only received computer instruction (p < 0.001)93; c) students who practised in groups of 6 to 8 students, with each using an individual computer, and those in pairs using 1 computer and giving each other feedback both improved in the proportion of correctly tied knots (p < 0.001). The authors concluded that peers could not substitute for expert faculty in this exercise.84 A computer and a lecture group had no significant differences in the proportions of correctly tied square knots or in the average time per knot.85 Summers and colleagues86 compared computer, videotape and didactic groups and found that the computer group had more complete knots at both immediate (p < 0.01) and 1-month follow up (p < 0.01) than did the didactic or videotape groups, and the didactic group scored higher on MCQs both times.
Because of the absence of significant results and failure to conduct a power computation, the study by Mann and colleagues91 is at moderate risk of bias. Because of the failure to correct statistically for the clustered design, 5 others (Carr et al,78 Rogers et al84,85 Summers et al,86 Todd et al80) are at moderate risk of bias.
Discussion
The interventions in this review were developed by practising academic surgeons to enhance teaching and learning in existing rotations. The interventions were thoughtfully designed, but the execution of some of the trials exposed them to the risk of bias. It was not the purpose of the researchers to establish ideal or safe minimum scores. It remains for further research to establish whether more instruction, more practice or both would enable students to achieve higher average scores and, presumably, greater procedural proficiency.
Norman96 notes that controlled trials in medical education often have a large unexplained variance in their results. He advocates the model of psychological research whereby the circumstances of the experiment are tightly controlled, and factors that might contribute to the results are systematically varied over a series of experiments based on a theory of causation, so that the effective causes are understood. This model could be applied to this field of study.
Educators and researchers could correct the design and execution problems noted in these RCTs. Improvements could include making a power computation; concealing randomization by blinding students, instructors and assessors; evaluating the different components of the interventions to assess which need strengthening; psychometric analysis of outcome measures to optimize their reliability and validity; assessing which aspects of instruction improve outcomes; describing and correcting problems in the learning process; identifying which students have difficulty with which aspects of the learning process; and correcting for the effects of clustering in the statistical analysis.
This review has described approaches to managaeble research questions (i.e., how to improve and measure the ability of students to tie square knots), but there are no reported RCTs on many common surgical procedures, such as foreign body extraction, wound assessment, common fractures or burns. Communication between surgical and primary care program directors could begin a planning process to conceptualize how to improve the learning of surgical and emergency skills and to identify the high-priority procedural skills requiring educational research and a specific curriculum. Consortia of programs or medical schools could cooperate to achieve the larger sample sizes.
Conclusion
RCTs of teaching undergraduates surgical and emergency skills have many positive study outcomes, but there are significant methodological weaknesses in study design. This systematic review provides a base-line for further work important to both medical education and clinical practice.
Footnotes
Competing interests: None declared.
- Accepted October 21, 2005.