Part 5 – Helping to Determine Clinical Significance
Whether or not outcomes of pain research studies are found to be statistically significant tells only part of the story, and a small part at that. What healthcare providers and their patients really need to know is whether the outcomes have importance for improving pain practice. And, for determining such clinical significance, understanding a statistical concept called the “Confidence Interval” can be critically helpful.
This fifth article in our series on “Making Sense of Pain Research” begins examining how data can be mathematically — that is, statistically — interpreted for providing answers to clinical questions in pain management. When many people hear the word “statistics” they want to hide, and even experienced readers of research often suffer from a certain amount of “statistiphobia.” However, it is not necessary to be a biostatistician to understand and apply basic statistical concepts for the interpretation of pain research; yet, it will require some careful study.
The prior UPDATE article in this series [Part 4 here] discussed P-values as estimates of the probability that an outcome measure is due to chance alone and/or is consistent with a hypothesis of no true difference or effect — the Null Hypothesis. However, a P-value provides only “yes” or “no” information based on the single investigation being considered; it is of limited use for understanding and determining possible clinical significance of the outcome, even if the P-value is statistically significant. Of greater use is the Confidence Interval, or CI.
CIs Defining the “Neighborhood”
A basic tenet of clinical research is that data gathered during any single investigation are only estimates, since they represent results from a small portion of the entire population of patients that might be affected. In short, the outcomes are but points along a continuum or range of possible outcomes if much larger populations were studied and/or the exact same research was repeated many times to determine its reliability.
Yet, no matter how many times the same study is repeated, slightly different outcomes will naturally emerge each time. So, then, what are the “true” or “real” results?
Any outcome measured during an investigation is only an approximation of the true result, so it is called a point estimate. This reminds us that, although the true value may reside somewhere in the “neighborhood” of the outcome measure, it is unlikely to be at that precise location. The larger neighborhood in which the true value is likely to reside includes the point estimate but is portrayed by a range of points — the Confidence Interval.
For example, imagine that the exact same research study can be repeated 100 times; where might the values for a particular point estimate fall? Would they all be the same? Highly unlikely. So, how wide would the range of point estimates be?
More specifically, suppose a clinical trial finds that a treatment improves the average, or mean, pain score by 40% (this is the point estimate). And, suppose that same study is repeated 99 more times (100 total), randomly selecting different subjects from the same overall population, and the point estimate ranges between 20% and 60%. We might feel confident, but we cannot be absolutely certain, that the “true” point estimate falls somewhere between 20% and 60%.
Since studies cannot be repeated so many times, calculation of the Confidence Interval, using data relating to the point estimate from a single study, provides a statistical solution for estimating the range of possibilities as to where the true value for an outcome might occur a certain percentage of the time.
Typically, researchers calculate the 95% CI, to infer where the point estimate might fall in 95 out of 100 of those repeated studies. In our hypothetical example above, the 95% CI might range from 25% and 55%, with the 40% point estimate (the original mean result found in the study) at its center (also see below).
In sum, the Confidence Interval around a point estimate result in a study indicates the limits within which the “true” outcome measure is likely to lie (although we never know for certain), and hence the strength of the inferences that can be drawn from the result. A 95% CI represents a range that likely includes the true point estimate of interest 95% of the time. Seldom will the true result be at the extremes of the range, and it might be outside the range only 5% of the time; that is, 2.5% of the time, or 1 chance in 40, it might be above the range and 2.5% below.
In research articles, authors should report outcome results as point estimates together with values for the associated confidence intervals. The text might be similar to, “The mean improvement in pain scores due to the treatment was 40%; 95% CI, 25-55” (from the above example).
Graphic Visualization Tells the Tale
Graphically visualizing CIs by plotting them as lines to scale can make their interpretation easier, especially when comparing multiple outcomes. Consider what the figure below — with point-estimates (large dots) and CIs (horizontal lines) for a Placebo group compared with 3 Treatments (A, B, and C) — can tell about outcomes. These hypothetical data follow from the example above, with values representing percentage improvements in pain scores.
- The Placebo condition — mean; 95% CI = 10%; -3% to 23% — is not statistically significant because the confidence interval range includes (ie, crosses) the vertical line of no effect, which in this case is 0%. A calculated P-value would be >0.05 and the Null Hypothesis of no probable effect due to Placebo would be accepted. As often happens with placebo conditions, although the point estimate is a mean 10% improvement in pain scores it also is possible that patients’ pain might actually worsen to a degree (up to a mean -3% within the 95% CI in this case).
- Treatment A is from our example above — mean; 95% CI = 40%; 25% to 55% — and the CI does not overlap either the line of no effect or the CI of either the Placebo group or Treatment C, so it is a statistically significant outcome in terms of the Null Hypothesis (of no effect) and significantly greater in comparison with Placebo or Treatment C. The absolute mean difference between A vs Placebo point estimates is 30%, which might be judged as being a clinically significant improvement in pain compared with Placebo.
However, in terms of comparing the limits of their respective CIs, it also is possible within our 95% confidence ranges that the mean difference might only be 2% (25% lower limit of Treatment A minus 23% upper limit of Placebo). The overall CI ranges for both Placebo and Treatment A are somewhat wide, indicating less precision in their point estimates than might be desired (and, often is found in studies that have inadequate numbers of participants; see discussion below).
- Treatment B — mean; 95% CI = 28%; 22% to 34% — is significantly greater than Placebo statistically, even though the lower limit of its CI overlaps the upper limit of the Placebo CI by a tiny amount. NOTE: a general rule of thumb is that an overlap of 50% or less across the ranges of two CIs suggest that they are still statistically significant at the P ≤ 0.05 level, but in such cases calculated P-values for the respective data can confirm this and should be reported by authors.
Similarly, Treatment B is statistically different from and superior to Treatment C, as there is little overlap of their CIs. Conversely, the CI ranges for Treatments A and B largely overlap, so they are not statistically different from each other. However, it is important to note that the width of the CI for Treatment B is relatively narrower than all other treatments, suggesting greater precision in its point estimate (eg, it is possibly a more accurate and “true” depiction of the outcome). Consequently, even though the point estimates for A and B are quite different, with A appearing to be the superior treatment, they might be clinically equivalent in importance relative to placebo. Only further research would determine if A is truly better than B compared with Placebo.
- Treatment C — mean; 95% CI = 16%; 8% to 24% — is clearly comparable to Placebo, since their point estimates fall well within the CI ranges of each other. Treatment C is statistically different from and inferior to Treatments A and B. In isolation, however, Treatment C represents a statistically significant effect, since its CI does not cross the 0% line of no effect and one also might say that this result was not a random or chance occurrence [also see the discussion in Part 4 of this series].
So, from examining the CIs and point estimates above, it appears that Treatment A is the superior alternative; however, there is some question as to whether it is truly better than Treatment B. Certainly, we would not have known that, or have as much information for gauging possible clinical effectiveness, if only mean point estimates and P-values were provided. It also is important to note that, while percentage data were used in the above examples, CIs can be calculated (and should be reported by authors) for almost any outcome measures, such as means, differences between means, proportions or ratios, correlations, numbers-needed-to-treat (NNTs), and others.
It often is worth the trouble to plot point estimates and their related CIs from a research report to scale on graph paper. There also is a convenient template for Microsoft Excel that will plot CIs and even calculate CIs from data given in a report when CIs are not provided by the authors [download the ESCI JPP template here]. Research articles that do not provide CIs for important outcomes, and do not provide data for a reader to calculate them, are highly suspect and should be considered cautiously.
The Importance of Precision
To recap, a point estimate is the best guess of the magnitude and direction of an outcome measure, while the Confidence Interval describes the uncertainty inherent in this estimate and describes a range of values within which we can be reasonably confident that the true or real effect actually lies. If the CI is relatively narrow, the outcome is known more precisely; conversely, for wider CIs the uncertainty of just where the true point estimate may lie is greater.
In some cases, a CI may be so wide as to be almost meaningless, providing little or no information for interpreting the importance or validity of the result, even if it is statistically significant in terms of its P-value. This is often the case with small studies that enroll relatively few participants.
On the other hand, a very narrow CI suggests considerable accuracy in the result and confidence that the point estimate is a reasonable representation of the true value; however, it still needs to be interpreted within the context of the overall study. That is, a precise and statistically significant point estimate (such as a mean difference) might still represent such a small effect that is not significant from a clinical perspective.
As noted above, the width of any Confidence Interval is highly dependent on sample sizes. Larger studies, with more participants, tend to give more precise estimates of outcomes (and hence much narrower CIs) than smaller studies. One reason is that all of the statistical formulas used to calculate CIs, and there are many variations of these, include the ‘n,’ or number of study participants, in their denominators. Hence, as the n increases in size, the CI usually decreases, and vice versa.
Technical note: For the benefit of statistics mavens we must acknowledge that study size is not the only influence on precision, and certain other parameters also used in the computations may affect CI widths. For continuous outcome measures, precision also depends on the variability in the outcome measurements (eg, standard deviation); for dichotomous outcomes it can also depend on the risk or frequency of the event; and, for time-to-event outcomes precision also depends on the number of events observed.
A CI may be reported for any level of confidence, such as 80%, 90%, 95%, or 99%. As the confidence level increases, the CI widens. For example, the 99% CI for a given value would be very much wider than an 80% CI for the same value. The 95% level is considered standard; studies reporting lower levels will have narrower CIs that could deceptively portray important differences between results that might not really exist.
The 95% Confidence Interval is loosely interpreted as indicating a range within which we can be 95% certain that the true effect lies. The strictly-correct interpretation of this CI is based on the hypothetical notion that, if a study were repeated an infinite number of times and on each occasion a 95% CI is calculated, then 95% of these Confidence Intervals would contain the true effect.
How Do CIs Relate to P-Values?
Confidence Intervals are related to P-values, but CIs provide more meaningful information for clinical decision-making purposes. Some authorities have even suggested that CIs could replace P-values, since, as we noted in Part 4 of this series [here], hypothesis testing using only P-values applies cut-off points for statistical significance that are arbitrary and do not portray a broader picture of the most accurate or true effects.
As portrayed in the graphic display of Confidence Intervals above, there is a logical correspondence between the CI and the P-value. The 95% CI for an effect will not cross the line of no effect or null value — such as a Risk Ratio of 1.0 or a mean value or difference of 0 — if the test of statistical significance yields a P-value of less than 0.05. If the P-value is exactly 0.05, then either the upper or lower limit of the 95% confidence interval might be touching the line of no effect, depending on which side of the null value the entire CI falls.
What if two or more Confidence Intervals overlap to some extent? Generally, when two 95% CI ranges overlap by no more than 50% the P-value is a significant 0.05 or less. This was the case in the graphic example above with Treatment B compared with Treatment C and Placebo. If the two CIs are just barely touching or there is even a small gap between their ends, the P–value is usually less than or equal to 0.01, as was the case with Treatment A compared with Treatment C and Placebo, above.
Assessing Clinical Significance
Close inspection of Confidence Intervals can be invaluable for assessing the clinical significance and importance of an outcome, but this must be considered within appropriate context. For example, suppose a research article reports on 3 analgesic therapies expected to reduce the occurrence of a particular adverse event and the authors claim that a risk reduction of at least 5% would be clinically important. Several outcomes of their research are depicted in the graphic.
All 3 scenarios depict 95% CIs and report a mean improvement of 10% for each treatment (point estimate risk reduction) — considerably greater than the 5% reduction deemed clinically important. Scenario #1 has a tight CI ranging from 7% to 13%, and even the lower limit exceeds the desired 5%; so, we might confidently conclude that this treatment is probably beneficial.
In scenario #2, with the same 10% risk reduction point estimate, the CI more widely ranges from 2% to 18%. This raises some doubt about the clinical importance of the treatment, since there is a possibility that the real effect could be less than the desired 5%.
Scenario #3 depicts an even wider CI, perhaps from a much smaller group of participants than the other two. It includes the null value of a 0% difference, so the result is not statistically significant and, even though the point estimate is 10%, we cannot be confident that the treatment has any real effect whatsoever. However, there is an important question to ask: Would a larger trial of this treatment have shown better, more significant results?
Data can be illusive. If we are provided only the 10% point estimates, all 3 treatments would seem promising. If P-values also are provided, the P > 0.05 for treatment #3 would rule it out as a contender. The outcomes for treatments #1 and #2 are each statistically significant (P < 0.05) and it is only by comparing their respective CIs that we can conclude treatment #1 might be the more clinically prudent choice.
In pain research reports, authors sometimes do not include 95% CIs for key outcomes and this is of some concern. Even worse is when researchers do not provide necessary data for a knowledgeable reader to calculate CIs, or they report lower-precision CIs, such as 90% or 80%, which have deceptively narrower ranges. Such reports merit close inspection to question what the data are actually depicting and their validity.
A final point to note: confidence Intervals are usually, but not always, symmetrical; that is, with the point estimate in the exact center of the CI range. However, there also can be asymmetry with the range on one side of the point estimate being wider than the other. This is particularly so for certain data that are constrained by upper or lower limits.
For example, the correlation coefficient, r, can only range from -1 to +1; so, the CI for a strong positive correlation, say r=+0.9, is constrained at the upper limit but not as much at the lower limit. Hence, the resulting 95% CI range for this +0.9 point estimate might be something like +0.5 to +0.95 with point estimate placement skewed toward the upper limit of the CI.
In sum, the clear presentation and discussion of CIs in pain research reports supports better interpretation and communication of outcomes. Confidence Intervals help to make the precision of study results understandable and their clinical significance more apparent. In many cases, however, CIs also reveal otherwise unknown weaknesses in results that challenge the external validity or clinical relevance of study outcomes and conclusions.
> Cumming G, Finch S. Inference by Eye: Confidence Intervals and How to Read Pictures of Data. Amer Psychologist. 2005(Feb-Mar);60(2):170-180 [PDF here].
> Cumming G. ESCI JPP (Exploratory Software for Confidence Intervals). Undated [available here].
> Cumming G. Inference by Eye: Pictures of Confidence Intervals and Thinking About Levels of Confidence. Teaching Statistics. 2007;29(3):170-180 [PDF here].
> Finch S, Cumming, G (2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychology. 2009;34(9):903-916 [PDF here].
> Fink A. How to Analyze Survey Data. Thousand Oaks, CA: Sage Publications; 1995.
> Greenhalgh T. How to read a paper: Statistics for the non-statistician, II — “Significant” relations and their pitfalls. BMJ. 1997;315(7105).
> Guyatt GH, Sackett DL, Cook DJ. Users’ guides to the medical literature: II. How to use an article about therapy or prevention; A. Are the results of the study valid? JAMA. 1993;270(21):2598-2601.
> Guyatt GH, Sackett DL, Cook DJ. Users’ guides to the medical literature: II. How to use an article about therapy or prevention; B. What were the results and will they help me in caring for my patients? JAMA. 1994;271(1):59-63.
> Higgins JPT, Green S (eds). Confidence Intervals. Section 12.4.1. Cochrane Handbook for Systematic Reviews of Interventions, 2010 [available here].
> Ingelfinger JA, Mosteller F, Thibodeau LA, Ware JH. Biostatistics in Clinical Medicine. 3rd ed. New York, NY. McGraw-Hill, Health Prof. Div.; 1994.
> Leavitt SB. EBAM (Evidence-Based Addiction Medicine) for Practitioners. Addiction Treatment Forum. March 2003 [PDF here].
> Statistical Significance Terms. Medical University of South Carolina, 2001 [online here].
> Sterne JAC, Smith GD. Sifting the evidence - what’s wrong with significance tests? BMJ. 2001;322:226-231.