Part 10 – Interpreting Effect Sizes in Research Data
One of the greatest and most frequent failings of pain research reports is authors presenting the statistical significance of outcome results, but not assessing the clinical importance of those findings. This is especially vexing when pain therapies are compared with each other or against placebo and there are differences between groups, as is commonly the case. What do the differences, when considered as effect sizes, mean for patients?
As statistician Darrell Huff stated more than half a century ago, “A difference is a difference only if it makes a difference” [Huff 1954]. When research demonstrates that one treatment or intervention might be better than another, healthcare providers and their patients want to know whether the differences can be trusted as being valid and also if they are of any consequence for better patient care; that is, what is the clinical significance?
An important approach for assessing the clinical significance of study data is the determination of effect sizes. Some organizations have recommended that researchers should always provide an estimate of effect size whenever reporting a P-value for an outcome, along with the Confidence Interval for the effect and a discussion of previously reported effect sizes for the outcome of interest [Durlak 2009].
What Are Effect Sizes?
If the only question of interest is — “Does the treatment or intervention produce an outcome that is probably not due to chance alone?” — then statistical significance, as denoted by a P-value, provides an answer. This was discussed in Part 4 of this series [here] and is mentioned further below.
However, if the question is — “Is the outcome sufficiently large to be of clinical significance?” — then effect size must be considered. In simplest terms, an effect size provides information about the magnitude and direction of a difference between outcome measures or of the relationships among the measures [Durlak 2009]. The ultimate goal is to assess the strength of the findings in ways that have meaning for clinical practice.
By some counts there are more than 40 effect-size statistics that can be calculated, but only a handful are considered here as being typical of primary endpoint measures in the pain research literature. Depending on what is being examined, an effect size can be a difference between mean values, a ratio or percentage, or a correlation. Some of these effects have been discussed in previous articles in this series on “Making Sense of Pain Research”: for example, relative risks [here], odds ratios [here], and standardized mean differences [here].
Ideally, readers of pain research reports should not have to know how to calculate effect sizes; rather, these should be provided by report authors along with interpretations and discussions of what they are and what they mean. Unfortunately, this is often neglected in pain research reports; frequently, authors seem focused more on demonstrating statistical significance of their results without further discussion of clinical significance in terms of effect size, which can be misleading. Therefore, critical readers need some basic tools for understanding effect sizes and their importance.
The following Table lists some common effect measures, and magnitudes of effect, that can be useful “rules of thumb” for interpreting data often found in pain research reports. Additionally, the accompanying explanations and caveats that follow below should be carefully studied. (Updated in February 2012.)
Explanation of Effect Sizes in the Table (also see notes regarding PTCalcs below)…
- Mean % change on VAS pain scale — The size categories for this effect represent average (mean) percentage changes in pain scores from baseline, as measured on a Visual Analog Scale (VAS), which have been considered of importance to patients. This also has been discussed in an UPDATE [here; and, in Dworkin et al. 2008]. VAS scales typically range from either 0 to 100 millimeters or 0 to 10 centimeters, with 0 being no pain and 10 or 100 being the worst pain imaginable.
Percentages falling between the ranges indicated in the Table are interpreted as straddling the categories. For example, a decrease from 50 mm to 28 mm, or 44%, might be considered as a “medium-to-large” effect.
A large 50% decrease is often, but not always, considered a standard for adequate pain relief in clinical trials [Seres 1999]. However, the pain level at baseline needs to be considered; for example, should changes from 90 mm to 45 mm and 60 mm to 30 mm — both 50% effect sizes — be considered of equivalent clinical importance? At least one study found that patients must achieve an outcome score of ≤4.0 cm (≤40 mm) as a result of an intervention in order to continue to improve without returning to initially higher baseline pain levels [discussed in UPDATE here; and Miller 2010].
- This effect represents the absolute mean change on a 100 mm VAS [from Chou et al. 2007, Table 3], representing a small/slight, moderate, or large/substantial effect for pain relief. A 10 cm VAS can be converted to this effect measure simply by multiplying the mean outcome values by a factor of 10; eg, a mean difference of 2.5 cm = 25 mm.
- Often, 11 point numeric rating scales (NRS) are used to assess pain, with 0=no pain to 10=worst pain imaginable [discussed in UPDATE here; and in Dworkin et al. 2008, 2009]. The values given in the Table for each category of effect magnitude are approximate mean changes from baseline.
- This effect is based on Standardized Mean Difference (SMD), or Cohen’s d score [previously discussed here]. This represents the difference between the means of two groups divided by the pooled (eg, roughly average) Standard Deviation (SD) of the two group means [Cohen 1992; Livingston et al. 2009]. Variants of the d score include Hedges’ g and Glass’s delta scores. An online calculator, PTCalcs [Excel worksheet here], can be useful for deriving Cohen’s d when group means and SDs are known. Or, if means and only Confidence Intervals (CIs) are provided in the report data [see PTCalcs worksheet here]. (Also see notes below regarding PTCalcs).
Alternate ranges for these effect magnitudes, as suggested in the Cochrane Collaboration Handbook [Cochrane 2011, sec. 12.6.2] and used by some researchers, include: small=<0.41, medium=0.41-0.70, large=>0.70.
- It is often wrongly assumed that the effect size for outcomes expressed between groups as proportions (or percentages) is derived merely by subtracting the two; however, proportional data do not follow a normal distribution. Therefore, for example, a change from 0.10 to 0.30 is not equivalent in effect size to a change from 0.40 to 0.60, even though the difference for each set is 0.20 or 20%. Calculating effect sizes for proportions requires an arcsine transformation before the two are subtracted, producing what is called the “h” statistic [Livingston et al. 2009]. This can be easily calculated using PTCalcs [Excel worksheet here].
- Chi-square (χ²) analysis is often reported in research reports for comparing relationships between ordinal or count data, usually forming a 2 x 2 cell contingency table [Dennis et al. 2009; Livingston et al. 2009]. For example, a table depicting patients who do and who do not respond to treatment with a new analgesic compared with those who do and do not respond to placebo. There are various approaches for deriving the effect size, denoted as “w,” and a simplified method using only the chi-square statistic and total number of subjects or observations (N) is available in PTCalcs [Excel worksheet here].
- Risk effects and their calculations were previously discussed in Part 6 of this series [here]. The null value for the Risk Ratio (also called Relative Risk, or RR) is 1.00, so as values diverge from 1.00 — either increasing above 1.00 or decreasing below 1.00 toward 0 — they represent larger effect sizes for the RR. For example, from the Table, an RR of 1.20 would be considered a medium effect size, whereas RR=0.85 would be a small effect [Cochrane 2011].
However, as considered in Part 6, the interpretation of the clinical importance of a given Risk Ratio also should be made with knowledge of the typical risk of events: a medium RR effect size of 1.25 might correspond to an absolute increase in events from 60% to 75% or a much smaller increase from 6% to 7.5%, depending on what is being measured.
- Odds and Odds Ratios (ORs) were previously discussed in this series [here]. The ORs for each category of effect-size magnitude in the Table were approximated from the respective Risk Ratios noted in item 7 of the Table. In some calculators [Excel worksheet here] OR of 1.30±, 1.90±, and 2.70± are considered as small, medium, and large effects, respectively [DeCoster 2009]. In any case, ORs in each category are of greater magnitude than the comparable RRs.
- The correlation coefficient, typically denoted by r, is an effect measure of the association or linear dependence between two variables, giving a value between +1.00 and –1.00. At the positive extreme, +1.00, there is a perfect linear relationship of values for one variable changing in the same direction and rate as values for the other variable; for example, if there is a directly corresponding increase in pain relief as analgesic dose is increased. The negative extreme, –1.00, indicates a perfect inverse linear relationship; that is, as values for one variable increase, values for the other variable decrease at the same rate — for example, therapeutic response to a drug decreasing as patient age increases. At the midpoint, or 0, there is no meaningful relationship at all between the two variables in question.
Some flexibility is necessary when interpreting the magnitude of r. Effect-sizes for an unadjusted r–value in the Table present one general rule of thumb [Glantz 1997], while Cohen  proposed the following categories: No Correlation, 0.0 to 0.09; Small/Weak, 0.1 to 0.3; Medium/Moderate, 0.3 to 0.5; Large/Strong, >0.5. As noted above, any r measure can be positive or negative in value, depending on the direction of the association.
- As with proportional data (point 5 above), distributions for r-values (correlation coefficients) are skewed, so a transformation is needed before two r-values can be subtracted from each other to derive an effect size for a difference between them, known as the “q” statistic. PTCalcs [Excel worksheet here] provides an easy way to derive this effect size.
- The numbers in this portion of the Table are not effects per se; however, sample sizes are important for determining effect sizes that can be detected as being statistically significant and are a component of study power [discussed in Part 9 of this series here]. If an effect size is large, fewer subjects in a study could be adequate to detect it as being statistically significant; however, as the effect size decreases in magnitude more subjects are required in each group to achieve statistical significance.
The Table suggests a minimum number of subjects in each study group to detect small, medium, and large effect sizes [Cohen 1992]. These are for 2-tailed tests; that is, when the outcome may be either positive or negative, on either side of the null value, which is typical in pain research. Others have recommended that the very minimum sample size in any pain study should be at least 50 to 100 subjects/group. Sample sizes smaller than this may starkly reduce the methodological quality and strength of evidence in a study, even if a statistically significant outcome and clinically meaningful effect size are found.
CAVEATS: The effect sizes listed in each Table category — small, medium, large — are only suggested approximations and are somewhat simplistic since, in many cases, the relationship between the magnitude of an effect and its clinical value in everyday practice is complex [Durlak 2009]. For example, it should be remembered that pain score improvement is only one dimension of treatment that is clinically important to patients; so, pain-relief effects often are considered in association with improvements in functionality, mood, sleep quality, and other factors.
Ultimately, the clinical importance of an effect size must be considered within the context of what is being measured and, if available, outcomes from other studies of the same intervention in question. For example, even a quite small effect size when the outcome is reducing deaths could be very important. On the other hand, a large effect in reducing pain for a new, costly drug with substantial adverse effects and little improvement in other areas may not be of great clinical value in comparison with currently available therapies.
HELP with CALCULATIONS: The calculator links above are to an online MicroSoft® Excel® spreadsheet from Pain Treatment Topics, PTCalcs.xls [available online here]. This is a multipart worksheet template that can be helpful for determining effect sizes and other parameters using outcome data commonly found in published research reports. There is a series of 11 easy-to-use statistical calculators; however, because varying methods may be used for calculating effect sizes, depending on the source consulted, these are intended only as an aid for assessing the quality and validity of research outcomes. Results are approximate and should not be used for inclusion in research papers submitted for publication.
Statistical Significance & Effect Size
Statistical significance, denoted by a P-value, was discussed in Part 4 of this series [here], and this statistic should be provided by report authors for effect sizes just as for other measures in a research study [or, the P-value can be calculated here in PTCalcs]. However, the statistical significance of an outcome effect is probably the least remarkable aspect of research results, since, as was observed in Part 4, ”not everything that is statistically significant is significant.”
There is no direct relationship between a P-value and the magnitude of an effect; a small and statistically significant P-value (eg, P<0.001) can occur with either a small-, medium-, or large-sized effect [Durlak 2009] — which is not to say that statistical significance is completely unimportant. The P-value for an effect measure does suggest the degree to which it might have come about by chance alone; so, if the effect is not significant statistically in the first place it cannot be assumed to be of any valid consequence clinically.
As a hypothetical example, in a study comparing a new drug with placebo the researchers report a large effect size for overall improvement in functionality; Cohen’s d=0.85, P≤0.05. Essentially, this is saying that there was only a 5% or less probability that this large effect was due merely to chance alone. However, the P-value is a function of both sample size and effect size. In a study with a quite small number of subjects per group (say, N<30) that finds a large effect size, the addition of only a single patient to the study might shift a P-level from below 0.05 to a nonsignificant value >0.05, without any substantive change in effect size [Durlak 2009].
On the other hand, studies with very large numbers of subjects pose a different sort of problem. The likelihood of finding statistically significant outcomes of interest increases with increased sample sizes, which is a common situation with data-mining studies that tap into large reservoirs of patient data. A study may need only a large enough number of participants for an effect size of small magnitude and trivial clinical importance to be statistically significant.
Confidence Intervals & Effect Sizes
Along with P-values, report authors should always provide Confidence Intervals, or CIs, for effect measures. CIs were previously discussed in Part 5 in this series [here]. All effect measures are essentially point estimates and the CI represents the likely range within which the true effect is likely to fall. This range depends on sample size, variability or spread of the data, and limits set by the researchers, such as the 95% confidence limit that is most typical. A 95% CI is essentially saying that, if the exact same study were repeated 100 times the true effect size would likely fall within the CI range in 95 of those trials.
So, the CI is an indicator of the precision and accuracy of the point estimate of the effect measure reported in the study. With smaller sample sizes, typical of many pain research studies, the CI for an effect becomes wider, or less precise, because there is usually more variation of data within groups. Conversely, with large group sizes, the CI tends to become more narrow and precise, and one can have more confidence that the reported effect size is accurate.
CIs also suggest the statistical significance of the effect measure. If the CI for the effect crosses the null value — or line of no effect, such as 0 or 1 depending on the measure — then it is not statistically significant; that is, it likely came about by chance. Here are some examples…
A correlation coefficient r=.25, 95% CI –0.05 to 0.45, includes the 0.0 null value for this statistic and is likely only a chance finding that is not statistically significant.
An RR=1.30 with a 95% CI of 0.90 to 1.70 would not be statistically significant since the range includes the null value of 1.00.
A Standardized Mean Difference, or d score = 0.45, 95% CI .05 to .85, would be a statistically significant effect; however, the CI ranges from negligible to large in effect size, demonstrating how statistical significance has its limits for assessing the importance of an effect measure.
The last example demonstrates how, without knowing the range of possible outcomes as represented by the CI, effect size data can be misleading; that is, if only the d score point estimate of 0.45 were reported, readers would assume a statistically significant medium effect size, whereas, the true effect might be of trivial clinical importance. Regrettably, authors frequently do not include CIs for effect measures and, in such cases, readers should have doubts about the reliability of the data.
Considering Effect Sizes to Clarify Outcomes
As suggested above, researchers have many statistical approaches for presenting study data, and some of these may skirt around tough questions regarding the validity and relevance of outcomes for treating patients. Sometimes effect-size measures are not calculated at all by authors or, if present, they are not interpreted or discussed in a clinically meaningful context. The burden of data interpretation and questioning the validity of outcomes then falls on critical readers, else they run the risk of being misinformed.
In some cases, readers will find effect-size data in the above Table of assistance. In lieu of that, readers can at the least become more skeptical about whether the data presented by report authors are telling a complete and unbiased story. Here are some examples.
This first example demonstrates how statistical significance was used to support the validity of study conclusions; whereas, effect sizes (not calculated by the report authors) tell a different story.
Example 1: A large-scale investigation assessed the healthcare-utilization burden imposed by long-term opioid analgesic users and whether nonadherence with prescribed opioid regimens, as determined by urine drug testing (UDT), increases healthcare costs in these patients [discussed in UPDATE here]. Reporting mean differences between groups, but not standardized effect sizes, the authors concluded that, (a) opioid users incur significantly greater healthcare utilization as compared with nonusers (P<0.001), and (b) patients who were nonadherent incurred significantly higher healthcare costs (P=0.036).
There were many biases inherent in this research and a close examination of the data, including performing simple calculations of effect sizes (which were not provided by the authors), raises important questions challenging the validity of this study:
- First, most of the difference in healthcare utilization between groups was associated with ambulatory/outpatient care visits, which had a very large effect size (Cohen’s d = 0.95). Whereas, effect sizes for emergency care and hospital admissions were 0.26 and 0.34, respectively; which signify small and probably immaterial differences between groups (see Table above). It might be expected that any patients with chronic pain would naturally visit their healthcare providers more frequently, whether or not they are being treated with opioids, so the validity of the first conclusion (a) is clinically doubtful.
- Secondly, for healthcare costs, the effect size for differences between opioid users and nonusers was only modest overall (d = 0.57) though statistically significant (P<0.001), and a small effect size distinguishing between adherent vs nonadherent opioid users was not clinically meaningful (d = 0.11) even though this also was statistically significant (P=0.036); therefore, the second conclusion (b) also is questionable.
This study is an example of researchers focusing on statistical significance, P-values, to “prove” their hypotheses of important differences between groups while neglecting to calculate and discuss effect sizes. And, having a large database from which to extract records (total N ≈ 100,000) resulted in small or moderate effect sizes that did achieve statistical significance (P<0.05), yet were of questionable value in supporting the clinical conclusions of the study. Readers would be unaware of these shortcomings and disparities unless they calculated and considered effect sizes on their own.
Close examination of another study suggests how effect-size data also must be considered in a qualitative sense, within a clinical context, for a more realistic interpretation of whether outcomes are of importance for patients.
Example 2: Researchers enrolled 80 patients with painful irritable bowel syndrome (IBS) in a novel study to test placebo effects [discussed in UPDATE here]. The experimental group (N=37) was given placebo pills, and were actually informed that the “medication” was a placebo, while the control group (N=43) had the same quality of interaction with the researchers but received no study medication or placebo for the disorder. On a 7-category IBS-Global Improvement Scale (IBS-GIS), placebo-group participants demonstrated a statistically significant improvement compared with controls (mean±SD: 5.0±1.5 vs 3.9±1.3; P=0.002). This also represented a large effect size between groups on this measure (Cohen’s d = 0.79), suggesting, as the authors concluded, “placebos administered without deception may be an effective treatment for IBS.”
Was this a clinically important finding? The numbers of subjects per group barely were adequate to detect even a large effect as being statistically significant, so one must question if this would still be valid in a larger study. However, of greatest importance, the absolute mean difference between groups on the IBS-GIS of about 1 point represented an average rating of “no change” in the control group compared with only “slightly improved” in the placebo-treated group; this vastly diminishes any clinical importance of the large effect size. There also were other limitations of this study; however, it illustrates how statistical significance and effect size must be considered within a clinical context to assess what might be of importance for actual patients.
Often, even when effect size measures or data for calculating these are not provided by authors, there are other clues that can be used to estimate the validity and potential clinical relevance of outcomes.
Example 3: An investigation [discussed in an UPDATE here] examined trends in opioid prescribing for chronic noncancer-related abdominal pain in the United States, using data from large national surveys collected from 1997 through 2008. The researchers divided the total 12-year time span into four 3-year increments for analyses (see Figure). The primary outcome measure was the probability of receiving an opioid during a visit, expressed as the proportion of visits for chronic abdominal pain in which an opioid analgesic was prescribed for the disorder.
On this measure there was more than a 2-fold increase, from an opioid being prescribed during 5.9% of visits in 1997-1998 to 12.2% of visits in 2006-2008 The Figure illustrates this statistically significant trend (P=0.03), with 95% Confidence Intervals (vertical error bars). Although the overall trend was statistically significant, was this a valid and clinically important change in prescribing patterns as the researchers claimed?
The greatest increases were between 1997-1999 and 2006-2008; however, the researchers do not calculate an effect size for this difference or provide necessary data for interested readers to do so themselves. Still, from a qualitative perspective, it can be observed that the 95% CIs in the Figure for those two time periods are somewhat wide and inclusive of each other — as represented by the lengthy and overlapping vertical error bars. This suggests the possibility that there could be only small, clinically negligible differences in opioid prescribing over time (even though the trend was statistically significant). As further corroboration of this likelihood, the researchers concede that their results were skewed by a single opioid agent, tramadol, included among the prescribed opioids. When they performed an analysis excluding tramadol the 12-year trend for increasing opioid prescriptions was no longer statistically significant.
Finally, researchers sometimes erroneously imply cause-effect relationships on the basis of small but statistically significant outcome measures of effect, when a broader look at the data might suggest otherwise.
EXAMPLE 4: There has been considerable and ongoing debate about a potentially life-threatening cardiotoxicity of methadone, allegedly evidenced by prolongation of the QTc interval on the electrocardiogram (ECG) in association with increasing methadone dose. In research reports, various effect-size correlations of methadone dose and QTc prolongation have ranged from nonexistent to rather low, for example r = +0.01 to +0.37 [Schmittner and Krantz 2006; Peles et al. 2007]. Many of these small effect sizes for r have been found to be statistically significant (P<0.05); however, this does not necessarily denote clinical significance, or a cause-effect relationship, in actually harming patients.
Still, some authors have falsely argued that, because the correlation coefficients (no matter how small) are statistically significant, the relationship of methadone dose to long QTc must be of clinical significance as well. Whereas, in truth, they should be acknowledging that in a statistical sense the associations were not likely due to chance alone, but the correlations were probably still too small to be considered of much clinical consequence.
Interestingly, one of the strongest and statistically significant correlations between methadone dose and QTc prolongation — r = +0.40, P = 0.03 — was still a small-to-medium effect size, at best, and occurred specifically in a subgroup of patients who also were abusing cocaine, which is well known to be cardiotoxic. There was practically no relationship at all in those methadone-treated patients who were not abusing cocaine: r = +0.04, P = 0.4 [Peles et al. 2007].
In the final analysis, both quantitative and qualitative factors must be taken into account to assess clinical significance of a treatment or intervention [Dworkin et al. 2009]. Even when quantitative effect measures are lacking, critical readers can still attempt to examine research evidence from qualitative perspectives to help gauge validity and clinical importance of the outcomes for actual patients with pain. Most crucially, readers should not be led astray by data presentations in research reports that convey more of an illusion of clinical significance than a reality.
A listing of this entire series on “Making Sense of Pain Research,” including a consolidated document in MS Word format, is available [here].
To be alerted by e-mail of when further UPDATES articles in this series are published, register [here] to receive once-weekly Pain-Topics “e-Notifications.”
> Chou R, Qaseem A, Snow V, et al. Diagnosis and treatment of low back pain: A joint clinical practice guideline from the American College of Physicians and the American Pain Society. Ann Intern Med. 2007;147(7):478-491 [available here].
> Cochrane Collaboration (Higgins JPT, Green S, eds.). Cochrane Handbook for Systematic Reviews of Interventions; Ver 5.1.0. 2011(Mar) [available here].
> Cohen J. A Power Primer. Psych Bull. 1992;112(1):155-159 [article PDF here].
> DeCoster J. Conversion between d, r, f, odds ratio, eta-squared, and AUC. 2009 [Excel worksheet here].
> Dennis ML, Lennox RI, Foss M. Power Analysis Worksheet. 2009 [Excel worksheet here].
> Durlak JA. How to Select, Calculate, and Interpret Effect Sizes. J Ped Psych. 2009;34(9);917-928 [article PDF here].
> Dworkin RH, Turk DC, McDermott MP, et al. Interpreting the clinical importance of group differences in chronic pain clinical trials: IMMPACT recommendations. Pain. 2009;146:238-244 [accessible here].
> Dworkin RH, Turk DC, Wyrwich KW, et al. Interpreting the clinical importance of treatment outcomes in chronic pain clinical trials: IMMPACT recommendations. J Pain. 2008;9(2):105-121 [accessible here].
> Glantz SA. Primer of Biostatistics. 4ed. New York, McGraw-Hill; 1997.
> Huff D. How to Lie With Statistics. New York, NY: WW Norton; 1954: p. 58.
> Livingston EH, Elliot A, Hyman L, Cao J. Effect Size Estimation. Arch Surg. 2009;144(8)706-712 [abstract here].
> Miller G. Studying the Difference Between ‘Statistical’ and ‘Clinically Meaningful.’ Pain Medicine News. 2010(Sep);8(9) [article here].
> Peles E, Bodner G, Kreek MJ, Rados V, Adelson M. Corrected-QT intervals as related to methadone dose and serum level in methadone maintenance treatment (MMT) patients - a cross-sectional study. Addiction. 2007(Feb);102(2):289-300.
> Schmittner J, Krantz MJ. QTc prolongation in methadone maintenance: fact and fiction. Heroin Addict Relat Clin Probl. 2006;8(4):41-52 [article in journal PDF here].
> Seres JL. The fallacy of using 50% pain relief as the standard for satisfactory pain treatment outcome. Pain Forum. 1999;8(4):183-188.