Part 9 — Why Size Really Does Matter
A common failing of clinical research studies in the pain management field is having too few subjects enrolled and/or insufficient effect sizes to yield significant and valid results. An antidote to this is called “power analysis,” performed during study design; yet, relatively few researchers report having done this or they sometimes proceed with trials despite having inadequate power. An understanding of statistical power, and its limitations, can help consumers of research appreciate why small-scale studies are often of little value for clinical decision-making purposes.
Clinical research in pain medicine can be costly in terms of time, effort, and expense, as well as incurring burdens and potential risks for study subjects. So, a goal of most research designs is to consume the least amount of resources and involve the smallest number of subjects to still achieve valid outcomes. But how large does a study need to be — or, more to the point, how few subjects are enough — to yield statistically significant and clinically important effects? This is a question that a power analysis can answer; at least to a considerable extent.
Pondering the Parameters of Power
In clinical investigations comparing two or more groups a statistical power analysis allows researchers to determine in advance how many subjects they need to enroll for detecting a certain effect size as being statistically significant, such as a meaningful reduction in pain scores. In other words, power helps to determine the chances of detecting a true difference between groups if one actual does exist.
One danger in studies with relatively few subjects — this is, having low power — is concluding that there is a significant difference between groups (eg, that a new treatment is beneficial) when such an effect actually does not exist. As was discussed in Part 4 of this Series [here], this is called a “Type I error,” or “false positive” outcome.
On the other hand, with too few subjects in each group, the power of the study may be too low to identify important differences that truly do exist. This sort of “false negative” outcome is called a “Type II Error” (see Part 4 of this Series). It is saying that any observed effects were purely due to chance when, in actually, true effects of possible significance might have been found with a more adequate research design.
A basic principle of power is that the smaller the effect size that is expected the more subjects need to be studied for detecting a statistically significant outcome; conversely, for larger effect sizes far fewer subjects might be adequate. A number of factors, or parameters, enter into power calculations, which are performed these days by statistical software packages. While readers do not need to know how to do the computing, a basic understanding of the components is essential for critiquing research designs and outcomes.
- To begin, researchers choose the level of certainty for statistical tests of significance [as discussed in Part 4 of this Series here]. Generally, this is set at P ≤ 0.05, which allows for up to a 5% probability that a particular effect size, such as a difference between group means on an outcome, is due merely to chance and not a true difference. This also is saying the researchers will allow up to a 5% chance of making a Type I error, or falsely concluding that an effect is statistically significant when it is not.
- Next, they must select the statistical power for the study. Customarily, this set at 0.80, which is saying that up to a 20% chance of making a Type II, or false negative, error would be acceptable; that is, rejecting an outcome, considering it non-significant, when it actually is significant.
Technically, the 20%, or 0.20 also is called “β” (beta), the probability of making a Type II error, and power=1.0 minus β. In this sense, power is the probability of NOT making a Type II error.
Stated another way, statistical power is the likelihood that a study will detect an effect when there is a true effect present to be detected. As the statistical power increases, the probability of making a Type II error — concluding there is no true effect when, in fact, there is one — goes down.
- Then, given the parameters for statistical significance and power, such as those above, there are 3 further data factors that enter into the calculation for achieving adequate power to improve the likelihood of observing statically significant outcomes. These also may be described in terms of a target-shooting analogy, as introduced [here] in Part 8 of this Series:
- Increase the sample size (ie, more subjects), which is analogous to increasing the number of shots taken at a target;
- reduce the variability (eg, Standard Deviation) of the outcome measures — akin to decreasing the spread of the shots, such as using a rifle rather than a shotgun, so more shots are on target;
- increase the magnitude of the observed treatment effect, perhaps by modifying study characteristics, which is like increasing the size of the target.
- Increase the sample size (ie, more subjects), which is analogous to increasing the number of shots taken at a target;
These 3 parameters, together with power and the P-value, form a closed system — if any one of them is modified it will affect the others and the potential for observing outcomes as being statistically significant. A purpose of power analysis is to find an appropriate balance among these factors by taking into account the substantive goals of the study and the resources available to the researcher (eg, how many subjects can be recruited and enrolled).
In their articles, researchers should report on the considerations that went into their power analysis during study design and the parameters they decided upon. If this is not described in the published article, it cannot be assumed that any power analysis was conducted or, in many cases, that the study was sufficiently large at the outset to achieve statistically significant and valid outcomes.
Example: Hypothetically, in an article describing a trial comparing Drug X with placebo, the authors might state: “Power analysis based on mean differences observed in an earlier clinical trial of Drug X suggested that 64 patients in each group would be needed to detect a medium effect size (0.50) as significant at the P<0.05 level (per a two-sided t-test of significance) with 80% power.” Therefore, they randomly divided 128 subjects into two groups; active treatment and placebo.
Keeping all other factors the same, increasing the sample size to 121 subjects per group would have allowed more stringent criteria; P=0.01, power=90%. Or, if the researchers only were interested in detecting a large effect size of about 0.80 — with P=0.05, power=80% — then 26 subjects per group could have been sufficient. So there are a number of ways of adjusting study parameters to achieve desired goals and finding significant outcomes.
At best, however, power analyses suggest minimum samples sizes, based on certain parameters (some of which may be guesses as to what will be observed in a study), to achieve statistically significant outcomes. Small studies that are overtly underpowered might be easily dismissed as invalid; however, it should be appreciated that extremely small samples sizes might in some cases achieve the designated power but still be of little value for either research or clinical purposes — which is sometimes the case with published articles in the pain research literature.
To reiterate the above concepts, it is common for researchers to calculate required group sizes based on a power of at least 80% and a P-value of 0.05. This provides for an 80% likelihood of reporting a statistically significant effect size between groups when one actually does exist. However, there is still a 20% chance of a “false negative, Type II error”; that is, not observing a significant difference that truly does exist and declaring the trial a failure. This problem is compounded in trials that end up with too few subjects at their conclusion due to drop-outs or those that do not enroll a sufficient number of subjects at the outset.
Example: Investigators conducted a randomized, controlled trial to examine efficacy of a procedure called PIRFT (percutaneous intradiscal radiofrequency thermocoagulation) in treating chronic low back pain due to intervertebral disc degeneration [discussed in a Pain-Topics UPDATE here]. Twenty eligible patients were randomized to either PIRFT or a sham intervention, 10 subjects per group. However, after 6 months the trial was stopped, since a preliminary analysis detected that there were no significant differences between groups emerging regarding changes in pain intensity (the primary endpoint), and some patients appeared to actually fare worse with PIRFT.
The authors had determined in advance that they would need at least 25 subjects in each group to have an 80% probability of detecting a 2-point statistically significant (P=0.05) improvement in pain scores — or, 80% power for this effect size. Since this study did not enroll sufficient numbers of patients there was inadequate power at the start to detect significant differences, raising questions as to whether the study should have been undertaken. As it stands, the study results, published in the journal Pain [Kvarstein et al. 2009], provide unreliable evidence of either success or failure, and a larger study might have demonstrated much more definitive results.
Size Affects Confidence Intervals
It is important to consider that, just because a study is calculated to be adequately powered and of a necessary size to produce statically significant outcomes — achieving a certain P-value — does not mean that those results will have a high degree of accuracy and/or will be clinically significant. A better measure of this is the Confidence Interval (CI) for the outcome effect, as was described in Part 5 of this Series [here].
Narrow CIs suggest considerable accuracy in the result and confidence that the point estimate of effect is a reasonable representation of the true value; however, the width of any CI, representing the range of possible outcomes, is directly dependent on sample size. Larger studies with more participants tend to give more precise estimates of outcomes and, hence, much narrower CIs than smaller studies. The Figure at right depicts 3 scenarios with 95% CIs and a 10% mean improvement (the risk reduction point estimate, with ≥5% judged as clinically important) for identical hypothetical studies of the same treatment, except for sample size.
- Scenario #1 has a large samples size, with superior power and a tight CI ranging from 7% to 13%. The result is clearly statistically significant, since the range does not include zero (0), and we might somewhat confidently conclude that the treatment could be beneficial.
- In scenario #2, fewer subjects were enrolled but the trial still has adequate power to detect a statistically significant outcome. With the same 10% risk reduction point estimate, the CI more widely ranges from 2% to 18%, which raises some doubt about the clinical importance of the treatment since there is a possibility that the true effect could be quite minimal.
- Scenario #3 depicts an even wider CI, from an underpowered study enrolling few subjects. The range includes the null value of a 0% difference, so the result is not statistically significant and, even though the point estimate of effect is still 10%, we cannot be confident that the treatment has any real effect whatsoever.
While power can help to determine whether or not a statistically significant result may be detected, if it does exist, the CI is more telling of the strength of evidence and where the possible “true” effect may lie. Even in the underpowered, non-significant study #3 there is a range of possible effect sizes that could be important, but these would not be detected as significant unless sample size is increased.
In pain research reports, authors frequently do not include 95% CIs for key outcomes, which can be misleading. On the one hand, successful trials, adequately powered and demonstrating statistically significant outcomes, may not be as accurate and consequential as one might be led to believe (as in Scenario #2). In other cases, treatment failures might have more promise than depicted in the study and more positive results might be evident if adequate power were achieved (Scenario #3).
Power Manipulation Strategies
As noted above, one of the easiest ways to increase power is by enrolling many more subjects in a study; however, this may require added time, expense, and other resources unavailable to the researchers. They could decrease their power requirement, to less than 80%, or use a more liberal probability level for denoting significance, say P=0.10, but these are obvious “red flags” that downgrade the quality and validity of any research study.
Another strategy, common to most research designs, is in the selective recruitment of patients as subjects. Among other things, meticulous selection helps to constrain the variability of response to a treatment or intervention (eg, the Standard Deviations of outcome measurement become smaller), which is one of the factors that enters into calculations to increase power. An added benefit of this is that, by including only subjects who will expectedly be responsive to the therapy in question, it could increase the effect size and thereby further boost power.
Example: In studying a new analgesic treatment for fibromyalgia, researchers might enroll only middle-aged white women with severe pain of long duration, while excluding those with depression/anxiety, prior exposure to the class of drug being tested, high body mass index, concurrent medical conditions, certain functional limitations, and/or other factors. In other words, there is seemingly an attempt to match human subjects to the homogeneous qualities of inbred laboratory animals that have previously demonstrated responsiveness to the treatment being tested. While this might seem deceptive, it is sometimes desirable to limit the inherent and numerous variations in humans that may otherwise confound response and distort outcome measures.
However, there is a critical question to ask: Are the research subjects, of whom there may be relatively few, representative of patients seen in everyday clinical practice? In many cases, the answer is “no.” Although the subjects could be typical of a select group of patients, the fact is that most patients in practice may not respond as did study subjects and this needs to be carefully considered by cautious and critical readers of the research.
Even with careful subject selection, smaller sample sizes can result in important but unknown prognostic variables not being in balance across study groups after randomization. Such situations may lead to biased outcomes if, by chance, the patients in one group have a more favorable prognosis at baseline [Koes et al. 1995].
Finally, in pain research, more often than not, relatively large effect sizes are of primary interest; for example, ≥50% improvement in pain scores among half or more of treated subjects. With such large effect sizes and low variability quite small numbers of subjects per group might be calculated as necessary for achieving 80% or even greater power at a 0.05 or lower significance level. However, even in the best of cases, statistically significant outcomes of such trials would incur very wide Confidence Intervals and uncertainties of interpretation, as noted above.
How Big is Enough?
Other than satisfying requirements of a power analysis — which provides approximate estimates of sample size — there are no rigid rules for how many subjects must be enrolled in a study for it to be considered of value (or, to be published in the pain literature, apparently). Some authors have suggested that for interventional pain management studies a sample size of 50 in each group is the smallest number that is appropriate [Koes et al. 1995]. In his classic paper, “A Power Primer,” Cohen  proposes that for detecting small, medium, and large effect sizes at the P=0.05 level with power=0.80, individual group sizes of 393, 64, and 26, respectively, would be required. Those are minimum approximations, since smaller sample sizes starkly reduce the methodological quality and strength of evidence in the study even if statistically significant results are found.
Simple online calculators for determining crude sample sizes and power in studies comparing two means or proportions as outcome measures are available [here] and [here]. However, it must be noted that power calculations also are influenced by the type of significance testing being applied to the data — such as a t-test, chi-square test, correlation statistics, analyses of variance or regression, etc. — and whether the testing is one- or two-tailed. Research authors should describe their power analyses within the context of these statistics and, ideally, in language that readers can understand.
In studies with relatively few subjects and too little power a finding of significant results cannot be trusted as valid. On the other hand, if there is too much power — with many more subjects enrolled than necessary — time and resources may be wasted for little gain, or, more troublesome, statistically significant results might be generated for effects that are of little if any clinical importance. This is sometimes the case with retrospective data-mining studies that tap into large reservoirs of information gathered on many thousands of patients.
Data-mining approaches have been previously discussed in an UPDATE [here]. Such studies have very high statistical power, but they may be like shooting at a large target with a machine gun (using the target-shooting analogy from above), and then looking for results that cluster together in significant ways. There is a high probability — a statistically significant chance (P < 0.05) — that something, usually multiple factors, of interest will be found and deemed worthy of reporting, even though the outcomes might be well off the bull’s-eye in terms of being clinically valid and important.
At the other extreme are small, underpowered studies that are overrepresented in the pain-research literature. Often, these are described by authors as “proof-of-principle” or “pilot” studies to justify their publication, and there are some reasons for conducting small-scale studies:
- They can be helpful for generating hypothesis and gauging effects sizes to be used in designing full-scale clinical trials in the future.
- Grant review committees may want to see a demonstration of treatment effects on a small scale before funding larger trials.
- If there are considerable risks to subjects — eg, an invasive or potentially toxic treatment — it could be preferable to start out with a small-scale test.
- The researchers simply may not have the resources to conduct a larger trial, but believe a small study would be better than none at all for beginning to answer important research questions.
While such studies may serve a purpose, there is some question as to whether they should be published so prominently, or at all, in mainstream pain-field literature. Too often, even when there are positive outcomes, the researchers do not go on to conduct larger, more definitive, and valid studies for confirming results. The small, underpowered studies may be (uncritically) cited in future articles as evidence of treatment efficacy, even though their quality is minimal and they actually demonstrate little of value for clinical practice. When it happens, readers usually are unaware of this deficiency in review articles and other literature.
Some have argued that multiple, similar small-scale studies might be combined into more powerful meta-analyses to demonstrate robust outcomes. However, both quantitatively and qualitatively it is doubtful that any number of underpowered studies would add up to even a single, large, well-powered and valid trial worthy of analysis. At best, a collection of similar small-scale studies might be viewed as a case series — of interest but a low quality of evidence.
Furthermore, experts have questioned the ethical propriety of conducting small-scale, underpowered studies [Halpern et al 2002; Kraemer et al. 2006]. They assert that, even in the most innocuous of small-scale studies, patients are being exposed to the burdens and possible hazards of experimental treatments or procedures, and time and money are expended, for very limited ends. In many cases, studies worth pursuing are aborted prematurely due to disappointing early results, and studies that are not aborted are underpowered and of questionable clinical value.
Example: In the study of PIRFT described above, there may be concerns regarding size that raise questions about its propriety. For one thing, the procedures — actual PIRFT or sham — were highly invasive and patient inclusion was extremely selective: only 74 patients were recruited for the study out of 700 referrals, and merely 20 actually participated in the trial (10 in each group). In an accompanying editorial, van Kleef and Kessels  suggest that stopping the study after the inclusion of only 20 patients was unacceptable, even though the study was underpowered. They observe that all outcomes did show positive trends for PIRFT, although these were not statistically significant with so few subjects. Still, it is understandable that the researchers were reluctant to continue their study when favorable results seemed doubtful, since there might have been ethical concerns about involving patients receiving invasive treatments in a study that was underpowered to produce valid outcomes at the outset, and therefore might provide little of value for the pain management field on the subject.
An added concern is that insurance plans are unlikely to allow reimbursement for treatments or procedures based on inadequate studies, or they may deny payment for therapies demonstrated as inefficacious in published trial reports. In the PIRFT example above, it was feared that the study, published in a major pain-field journal, would result in rejection of compensation for the procedure by insurance providers based on the negative outcomes of a single, underpowered, aborted trial; plus, any future research in support of PIRFT would need to overcome the negative perceptions to qualify the intervention for payment.
Conclusion: Power Can Be Precious
Sample sizes in clinical pain research are often inadequate and, consequently, genuinely worthwhile treatments may be dismissed as ineffective. More commonly, marginally beneficial therapies may be reported as being of significant efficacy when more rigorous and larger-scale investigations might demonstrate otherwise.
The accurate reporting of clinical trials — whether outcomes are favorable or unfavorable — can be critical for establishing a valid and trustworthy base of evidence in pain management. However, the value of conducting and publishing small, underpowered studies must be questioned.
At times these investigations, such as pilot studies, might be helpful for designing more extensive future research, but reporting them in the literature as if they are of consequence may prematurely skew or bias perceptions of the treatments or interventions in question. This is particularly troublesome when such small studies end up being included as evidence in review articles, meta-analyses, or guidelines — which often happens.
An important point, however, is that outcomes demonstrated in small-scale underpowered studies should not be extrapolated to broader populations until larger, robustly powered, studies are conducted and there is a sufficient accumulation of confirming evidence. A frequent failing in the pain literature (and news media) is ignoring this and sometimes making medical mountains out of the exploratory mole hills portrayed by small-scale investigations.
> Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ. 1995;311:485 [article here]
> Dworkin RH, Turk DC, Katz NP, et al. Evidence-based clinical trial design for chronic pain pharmacotherapy: A blueprint for ACTION. Pain. 2011(Mar);152(suppl-3):S107-S115 [access by subscription].
> Glantz, SA. Primer of Bio-Statistics. 4th ed. New York, NY: McGraw-Hill, Health Prof. Div.; 1997.
> Greenhalgh T. How to read a paper: Assessing the methodological quality of published papers. BMJ. 1997;315(7103).
> Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;17;288(3):358-362 [abstract here].
> Kraemer HC, Mintz J, Noda A, et al. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry. 2006;63:484-489.
> Ingelfinger JA, Mosteller F, Thibodeau LA, Ware JH. Biostatistics in Clinical Medicine. 3rd ed. New York, NY. McGraw-Hill, Health Prof. Div.; 1994.
> Koes BW, Scholten R, Mens J, Bouter LM. Efficacy of epidural steroid injections for low-back pain and sciatica: a systematic review of randomized clinical trials. Pain. 1995;63:279-288 [abstract].
> Kvarstein G, Mawe L, Indahl A, et al. A randomized double-blind controlled trial of intra-annular radiofrequency thermal disc therapy – a 12-month follow-up. Pain. 2009(Oct);145(3):279-286 [abstract].
> Manchikanti L, Hirsch JA, Smith HS. Evidence-Based Medicine, Systematic Reviews, and Guidelines in Interventional Pain Management: Part 2: Randomized Controlled Trials. Pain Physician. 2008;11:717-773 [article PDF].
> Pandit JJ, SM Yentis. All that glisters… How to assess the ‘value’ of a scientific paper. Anaesthesia. 2005;60:373-383 [article PDF].
> Power Analysis. StatSoft: Electronic Statistics Textbook. Undated [online here].
> van Kleef M, Kessels AGH. Underpowered clinical trials: time for a change. Pain. 2009;145(3):265-266.
> Williams F. Reasoning with Statistics. New York: Holt, Rinehart and Winston; 1968.
> What is power analysis? Power & Precision. Undated [online here].