Article Text

If a court fails to convict a defendant because of incomplete evidence, does that establish his innocence beyond doubt? Not necessarily. Indeed in Scotland, if sufficient uncertainty remains, the court can give a verdict of “not proven” instead of “not guilty”. If a randomised controlled trial (RCT) fails to show a significant difference between the treatment and the control group, does that prove that the treatment has no useful clinical effect? Again, not necessarily. The treatment may work, but the trial may have been unable to prove it.1 Despite this, many such “negative” trials,1 2 including many published in this journal, may wrongly be taken as evidence that the treatment is not clinically useful.

For example, in an RCT of women at risk of preterm delivery that was not published as a full report,3 respiratory distress syndrome (RDS) occurred in three of 23 babies born to the treated group and three of 22 babies born to the untreated group. The difference is not significant (2p > 0.9). Had this been the first and only study of this treatment, many people might have decided that it was not effective and thus lost interest. In fact, overviews of this,3 and at least 14 other trials, eventually showed that the treatment—antenatal steroids—is highly effective because it reduced RDS and neonatal mortality in over 3500 preterm infants by about half.4
5 Note that the results of the single trial were quite consistent with this finding.3 The correct conclusion from that single trial is not that antenatal steroids do not work, but that the trial lacked sufficient power to detect anything but the most spectacular treatment effect. About half of all RCTs reported in *Archives of Disease in Childhood* between 1982 and 1996 recruited fewer than 40 children in total.6Trials as small as this lack the power to detect moderate treatment effects and carry a significant risk of false negative results.6

This is easier to see if trial data are presented with a point estimate of the effect, such as a relative risk or an odds ratio, and a measure of precision, such as a confidence interval (CI). If a treatment truly has no effect, the probability of a poor outcome should be the same for treated and untreated patients, so the relative risk and the odds ratio will each tend to be about 1. In the example just cited,3the odds ratio for RDS *v* no RDS between treated and untreated groups is 0.95 (3/20 divided by 3/19), and the 95% CI around it ranges between 0.17 (a reduction of 83%) and 5.21 (an increase of 421%). So, although the odds ratio is close to 1, this particular trial rules out neither a substantially beneficial nor a substantially harmful effect because the CI is wide. An overview of all 15 trials gives an odds ratio for the effect of antenatal steroids on RDS of 0.53,5 with a much narrower 95% CI (0.44 to 0.63). In other words, it suggests that treatment with antenatal steroids is likely to reduce the odds of RDS by between 37% and 56%, an unequivocally substantial benefit, which is highly significant.

When should readers conclude that a treatment really is not clinically useful? Again, a CI is helpful, and surprisingly large numbers may be needed. In the fourth international study of infarct survival (ISIS-4), 58 050 patients with suspected myocardial infarction were randomly allocated to intravenous magnesium sulphate or placebo.7 There were 2216 deaths and 26 795 survivors in the treated group and 2103 deaths and 26 936 survivors in the placebo group, a difference that gives an odds ratio for increased mortality with magnesium of 1.06, with a 95% CI of 1.00 to 1.13 (2p = 0.07). In other words, magnesium, at least as it was given in this particular study, was not effective because it was unlikely to reduce mortality (and may even have increased it by up to 13%). Similarly, readers can only reliably conclude that two active treatments are equivalent—or that any difference between them is too small to be clinically important—when the sample is large enough.8

How can researchers design RCTs powerful enough to show that no clinically important differences exist between treatment and placebo or between two active treatments? This requires prior estimation of appropriate sample sizes, which may require consultation with a statistician, but can easily be done for dichotomous outcomes (for example, survival or death) using software such as Epi Info.9 This software package allows calculation of relative risks, odds ratios, and 95% CI, and can be downloaded free of charge from the internet (http://www.soton.ac.uk/∼medstats/epiinfo/). Calculating sample sizes when the outcome is a continuous variable (for example, blood pressure or length of stay) is more complicated and will almost certainly require consultation with a statistician. It may be added that the “null hypothesis”, to the effect that a treatment difference is *exactly* equal to 0 or a relative risk or an odds ratio *exactly* equal to 1, is often neither plausible nor interesting. Far more important is the question whether the size of the treatment effect is large enough to be of clinical interest, or small enough to be ignored. A conventional significance test (p value) cannot provide this information; only a range that covers the true value of the treatment difference with known confidence can do so.

Many investigators report in their tables of results two columns of means or percentages for the control and treated arms of the trial. In the former case, standard deviations, standard errors, or confidence limits for each column are commonly included. In fact, the quantities of interest to the reader are the differences between the two columns (or odds ratios for percentages), and these should always be shown with their standard errors or confidence limits. This is especially important when the data involve pairing or matching of treated and control subjects, as in crossover studies, because then the precision of the difference cannot be derived from the individual standard deviations.

The presentation of trial results has important implications for readers, authors, editors, referees, and patients. Wrongly discounting treatments as ineffective will deprive patients of better care. Wrongly accepting treatments as effective exposes patients to needless risk and wastes resources. We can all help to address these problems by expecting, and routinely including, CI or other measures of the precision of estimates of outcome in trial summaries and reports, and stating whether and how the sample size was calculated in advance.10 These measures have been recommended in the CONSORT statement,11 which *Archives of Disease in Childhood* has endorsed (see editors’ note in reference 6). We can also design and support larger trials with the power to detect realistically moderate, rather than over optimistically large, effects of treatment.6
12 Increasingly, such trials will require multicentre collaboration and should be simple so that busy centres can contribute without taking on too great a burden of extra work.

## Authors’ note

The demand in the CONSORT guidelines11 that clinical trial reports should count and characterise all patients not included in the trial imposes further work on busy participants and has been criticised as being frequently of little value and often impossible.13 It seems more important to describe key characteristics of the patients when randomised into the trial and report outcomes in prespecified subgroups, so that the results can be generalised to other patients with similar characteristics.

## Acknowledgments

We thank Richard Peto and the anonymous referee for helpful comments. The Perinatal Epidemiology Group is part of the Medical Research Council Health Services Research Collaboration.

## Statistics from Altmetric.com

If a court fails to convict a defendant because of incomplete evidence, does that establish his innocence beyond doubt? Not necessarily. Indeed in Scotland, if sufficient uncertainty remains, the court can give a verdict of “not proven” instead of “not guilty”. If a randomised controlled trial (RCT) fails to show a significant difference between the treatment and the control group, does that prove that the treatment has no useful clinical effect? Again, not necessarily. The treatment may work, but the trial may have been unable to prove it.1 Despite this, many such “negative” trials,1 2 including many published in this journal, may wrongly be taken as evidence that the treatment is not clinically useful.

For example, in an RCT of women at risk of preterm delivery that was not published as a full report,3 respiratory distress syndrome (RDS) occurred in three of 23 babies born to the treated group and three of 22 babies born to the untreated group. The difference is not significant (2p > 0.9). Had this been the first and only study of this treatment, many people might have decided that it was not effective and thus lost interest. In fact, overviews of this,3 and at least 14 other trials, eventually showed that the treatment—antenatal steroids—is highly effective because it reduced RDS and neonatal mortality in over 3500 preterm infants by about half.4
5 Note that the results of the single trial were quite consistent with this finding.3 The correct conclusion from that single trial is not that antenatal steroids do not work, but that the trial lacked sufficient power to detect anything but the most spectacular treatment effect. About half of all RCTs reported in *Archives of Disease in Childhood* between 1982 and 1996 recruited fewer than 40 children in total.6Trials as small as this lack the power to detect moderate treatment effects and carry a significant risk of false negative results.6

This is easier to see if trial data are presented with a point estimate of the effect, such as a relative risk or an odds ratio, and a measure of precision, such as a confidence interval (CI). If a treatment truly has no effect, the probability of a poor outcome should be the same for treated and untreated patients, so the relative risk and the odds ratio will each tend to be about 1. In the example just cited,3the odds ratio for RDS *v* no RDS between treated and untreated groups is 0.95 (3/20 divided by 3/19), and the 95% CI around it ranges between 0.17 (a reduction of 83%) and 5.21 (an increase of 421%). So, although the odds ratio is close to 1, this particular trial rules out neither a substantially beneficial nor a substantially harmful effect because the CI is wide. An overview of all 15 trials gives an odds ratio for the effect of antenatal steroids on RDS of 0.53,5 with a much narrower 95% CI (0.44 to 0.63). In other words, it suggests that treatment with antenatal steroids is likely to reduce the odds of RDS by between 37% and 56%, an unequivocally substantial benefit, which is highly significant.

When should readers conclude that a treatment really is not clinically useful? Again, a CI is helpful, and surprisingly large numbers may be needed. In the fourth international study of infarct survival (ISIS-4), 58 050 patients with suspected myocardial infarction were randomly allocated to intravenous magnesium sulphate or placebo.7 There were 2216 deaths and 26 795 survivors in the treated group and 2103 deaths and 26 936 survivors in the placebo group, a difference that gives an odds ratio for increased mortality with magnesium of 1.06, with a 95% CI of 1.00 to 1.13 (2p = 0.07). In other words, magnesium, at least as it was given in this particular study, was not effective because it was unlikely to reduce mortality (and may even have increased it by up to 13%). Similarly, readers can only reliably conclude that two active treatments are equivalent—or that any difference between them is too small to be clinically important—when the sample is large enough.8

How can researchers design RCTs powerful enough to show that no clinically important differences exist between treatment and placebo or between two active treatments? This requires prior estimation of appropriate sample sizes, which may require consultation with a statistician, but can easily be done for dichotomous outcomes (for example, survival or death) using software such as Epi Info.9 This software package allows calculation of relative risks, odds ratios, and 95% CI, and can be downloaded free of charge from the internet (http://www.soton.ac.uk/∼medstats/epiinfo/). Calculating sample sizes when the outcome is a continuous variable (for example, blood pressure or length of stay) is more complicated and will almost certainly require consultation with a statistician. It may be added that the “null hypothesis”, to the effect that a treatment difference is *exactly* equal to 0 or a relative risk or an odds ratio *exactly* equal to 1, is often neither plausible nor interesting. Far more important is the question whether the size of the treatment effect is large enough to be of clinical interest, or small enough to be ignored. A conventional significance test (p value) cannot provide this information; only a range that covers the true value of the treatment difference with known confidence can do so.

Many investigators report in their tables of results two columns of means or percentages for the control and treated arms of the trial. In the former case, standard deviations, standard errors, or confidence limits for each column are commonly included. In fact, the quantities of interest to the reader are the differences between the two columns (or odds ratios for percentages), and these should always be shown with their standard errors or confidence limits. This is especially important when the data involve pairing or matching of treated and control subjects, as in crossover studies, because then the precision of the difference cannot be derived from the individual standard deviations.

The presentation of trial results has important implications for readers, authors, editors, referees, and patients. Wrongly discounting treatments as ineffective will deprive patients of better care. Wrongly accepting treatments as effective exposes patients to needless risk and wastes resources. We can all help to address these problems by expecting, and routinely including, CI or other measures of the precision of estimates of outcome in trial summaries and reports, and stating whether and how the sample size was calculated in advance.10 These measures have been recommended in the CONSORT statement,11 which *Archives of Disease in Childhood* has endorsed (see editors’ note in reference 6). We can also design and support larger trials with the power to detect realistically moderate, rather than over optimistically large, effects of treatment.6
12 Increasingly, such trials will require multicentre collaboration and should be simple so that busy centres can contribute without taking on too great a burden of extra work.

## Authors’ note

The demand in the CONSORT guidelines11 that clinical trial reports should count and characterise all patients not included in the trial imposes further work on busy participants and has been criticised as being frequently of little value and often impossible.13 It seems more important to describe key characteristics of the patients when randomised into the trial and report outcomes in prespecified subgroups, so that the results can be generalised to other patients with similar characteristics.

## Acknowledgments

We thank Richard Peto and the anonymous referee for helpful comments. The Perinatal Epidemiology Group is part of the Medical Research Council Health Services Research Collaboration.

## Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.