Part 8 — Precision, Accuracy, & Significance of Mean Values
Research outcomes are commonly reported as average, or mean, values; however, averages can be misleading, so it is important that they are examined from perspectives of their precision and accuracy. And, better understandings of statistical measures such as the Standard Deviation (SD), Standard Error of the Mean (SEM), and Standardized Mean Difference (SMD) are essential for assessing the validity and clinical significance of pain research data.
As a reminder, a mean (average) value is calculated by adding up all of the individual data in a group for a particular measure — eg, pain scores, global improvement ratings, etc. — and dividing by the number of data-points that were added. Outcomes in pain research are often presented as means, but there are several considerations:
- A study group of any size is merely a small representation of the overall patient population that might be of concern; so, at best, the group mean on a particular outcome measure is merely an estimation of what the true population mean might be.
- Individual data or scores of group participants might be widely scattered over a scale of measurement — high, low, or in between — and their mean alone does not reveal the data variability or extreme differences between subjects.
- The mean suggests a single point estimate of effect representing a hypothetical average patient who might not actually exist in everyday practice.
Therefore, a mean value may or may not be a good indicator of individual patient performance on the measure and it is important to know the precision and accuracy of the mean when assessing clinical significance of the outcome in question. Two statistical parameters assist in this task: the Standard Deviation (SD) and Standard Error of Mean (SEM). These help to describe the “spread” of data: the SD portrays the precision of the mean and SEM characterizes it accuracy. Both are calculated statistically from data collected during a study, are often reported by researchers in their articles, and can be very misleading unless readers understand how to interpret them.
Targeting the Standard Deviation (SD)
The SD reflects the variation among individual data collected on a particular measure. Attaining a large SD is like shooting at a target with a shotgun, where the pellets splay out; however, individual placements on the target might still average out to a bull’s-eye. Whereas, a small SD is more like shooting with a rifle, where all shots are clustered closer together in one area of the target, whether or not it is at the center bull’s-eye. The smaller the SD the better, because it suggests that all of the shots, or data, are more tightly grouped around the mean; that is, the mean is a more precise or true measure of individual performance.
For example, consider two sample groups of patients with pain scores recorded on a 0-10 scale after treatment. The mean score for both groups is 5.0; however, the scores for Sample A range from only 4 to 6, whereas the scores in Sample B are much more variable, ranging from 2 to 9. Accordingly, in this case the SD for Sample A is only ±0.80, whereas the SD for Sample B is three-fold greater, or about ±2.45. Even without knowing the individual scores, which is usually the case in research articles, we know from the SDs that for Sample A the mean of 5 is a much more precise portrayal of the pain-score outcome; that is, the mean ± SD, or 5.0±0.80, indicates relatively little variation in the scores around the mean.
In sum, the SD estimates the potential spread or variability of data around the mean and, in a normal distribution, almost all of the data (>99%) would expectedly fall within three Standard Deviations above or below the mean (±3xSD). When SDs among the outcome means of groups being compared in the same study are vastly different from each other one might suspect that something is amiss; as if the groups represent vastly different populations; eg, a shotgun versus rifle mismatch, in the target-shooting analogy above.
Beware the Standard Error of the Mean (SEM)
The Standard Error of the Mean (SEM) — sometimes just referred to as the Standard Error, or SE — represents the hypothetical spread that mean values themselves would have if the researchers kept collecting measurements from same-sized samples drawn from the same population. In our analogy above, the SEM would depict the variation of mean scores accumulated from multiple rounds of target shooting, with the same shooter using the same weapon and the same number of bullets each time.
Therefore, in a research study, the SEM suggests how accurate the observed mean might be in depicting the true population mean on the measure or outcome in question. In a sense, the SEM is like a Standard Deviation of the means, portraying estimated variation of sample means.
Continuing the example above, the SEM for Sample A is ±0.21; whereas, the SEM for Sample B is a larger ±0.70. This tells us that the mean for Sample A is a much more accurate depiction of an average value for the group. However, this can be rather misleading, since SEMs are typically small numbers; less than 1 point on a 10-point scale in this example. So, taken by itself, even the larger 0.70 SEM for Sample B might be misinterpreted as suggesting good accuracy of the mean.
Some authorities believe that means reported in articles always should be accompanied by SEMs to indicate their accuracy. And when comparing two means, they argue that the respective SEMs give an idea of whether there is a statistically significant difference between the means. However, this is largely erroneous.
For one thing, SEMs are usually deceptively small values and data expressed as the mean±SEM for each of two means can imply significant differences that do not actually exist (see also graphical explanation below). Whereas, showing SDs with means conveys a better idea of the magnitude of the differences, or lack thereof, between the means based on variability of the individual data within each group.
Confusion of SD, SEM, and Confidence Intervals
As noted above, the SEM suggests the accuracy of the mean, and the SD connotes the variability of individual observations and the precision of the mean. The two are statistically related, with the SEM always being much smaller than the SD, calculated as: SD = SEM x √N (square root of sample size N); conversely, SEM = SD/√N. The much more helpful measure is the SD, which can be calculated from the SEM and sample size if necessary.
Furthermore, SDs are used to calculate another vital statistic — Confidence Intervals, or CIs, which were discussed in Part 5 of this series [here]. Essentially, the CI indicates a range of values within which the true mean is likely to reside a certain percentage of the time — usually 95%. The CI not only indicates if an individual mean is statistically significant, but it will suggest whether two means being compared are significantly different from each other. It is an extremely useful statistical measure.
From our example above, each sample had a mean=5.0 and here is the respective SD, SEM, and 95% CI:
Sample A: SD=±0.80; SEM=±0.21; 95% CI is ±0.45, or 4.55 to 5.45.
Sample B: SD=±2.45; SEM=±0.70; 95% CI is ±1.40, or 3.60 to 6.40.
Several relationships are worth noting: 1) the spread of the CIs around each mean is more telling of the differences in variation of the two sample means than the SDs or SEMs; 2) in each case, the SD is a much larger value than the respective CI, so the SD should not be thought of as a substitute for the CI; 3) the CI value in each sample is roughly twice the SEM value, which can be helpful for estimating CIs when only SEMs are denoted or shown on a graph in a report. NOTE: These relationships are only valid for group sizes >10.
Therefore, probably a most helpful application of SEMs is converting them to approximations of the respective CIs — CI ≈ mean ± (2 x SEM). In turn, this can help to determine statistical significance of the respective means [see Part 5 in this series for a further explanation].
As a further example, consider the graph at left depicting mean values (dots) for two groups on a measure, such as pain scores, at 4 time points. Depending on what the “error bars” (the vertical lines extending above/below each mean) represent, the interpretation of data changes dramatically.
If the error bars depict SEMs, the means “appear” to be significantly different at each time point. However, if the SEMs are roughly converted to CIs (by visually doubling the length of each SEM error bar) there would be extensive overlap of the bars indicating the two groups are NOT significantly different. Therefore, SEMs can falsely make unimportant differences in mean data seem of some consequence.
On the other hand, if the error bars depict SDs or CIs the two groups actually ARE significantly different at each time point, since there is very little if any overlap of the error bars. In some cases, authors do not label their graphs to indicate what the error bars represent, and this can create confusion in data interpretation; eg, readers might mistakenly believe that SEMs are CIs or SDs and that significant differences exist when they do not.
What Do Differences Between Means Say About Significance?
Very often in pain research, investigators (and readers) are most interested in differences between groups; eg, how an experimental group compares with a placebo group on some outcome measure of importance. For this purpose, the mean value of the measure in one group is subtracted from the mean value of the other group.
But, what do the mean differences say about clinical significance of the outcome? How can mean differences be compared across different studies that examine the same outcome?
In response to these two questions, a measure of effect size called the Standardized Mean Difference, or SMD, may be utilized. This is the difference between two group means divided by the “pooled variance” of the two means, which, in simplest terms, is the sum of the two respective SDs divided by 2 (a simple averaging). However, if two group SDs are quite different from each other, and/or group sizes vary greatly, more sophisticated calculations might be performed [see, Ellis 2009]. In brief, the SMD indicates the amount of difference between two means converted to a standardized unit of measure.
For example: say a study reports pain score outcomes (0-10 scale) with a treatment (Group 1) compared with placebo (Group 2); 30 subjects per group. Mean(±SD) for Group 1 might be 5.0 ± 2.45 compared with 3.0 ± 2.0 for Group 2 — a statistically significant difference (P<0.01). The approximate SMD = M1–M2 / [(SD1+SD2)/2] = 2/2.23 = 0.90, which is qualitatively a large and clinically significant effect size.
Converting differences between group means to standardized effect sizes in this way — by adjusting the difference for the variability of the measures (SDs) — also makes it possible to compare mean effects on the same outcome measure across studies. For example, via standardization, mean differences in pain scores in two studies of an experimental medication compared with placebo can be compared, even if one study used a 0-10 pain scale and the other a 0-100 scale, and/or if there was a much larger SD in one of the studies than in the other.
The SMD effect size conveys a qualitative estimate of the importance of the association between two means. A common representation of SMD is “Cohen's d” [Cohen 1992] which suggests how big or small an effect size is in more clinically meaningful terms. An effect size d, or SMD, between two means within a range closely encompassing 0.20± is considered small (possibly clinically non-significant), 0.50± is a medium effect, and 0.80± or greater is considered large (and clinically significant).
It is important to note that there is no direct relationship between a P-value and the magnitude of effect represented by the SMD [Durlak 2009]. That is, a difference between means can be statistically significant (eg, P≤0.05) yet be clinically unimportant if the respective SMD denotes a small-to-medium effect size. In some cases, research authors may report differences between means as being statistically significant, and allegedly important, but do not calculate or discuss the respective SMD effect sizes that might suggest the outcomes are not clinically significant. This sort of misleading bias in research reporting also was discussed in a prior Pain-Topics UPDATE [here].
Easy-to-use effect size calculators — when mean values and their standard deviations are known — can be found [here] and [here]. So, interested readers can calculate SMDs themselves from data typically given in a report.
Summary Practice Pointers
Mean data, and differences between means, described in pain research reports can be either helpful or misleading depending on how they are presented and interpreted. All too often, merely by omission of key analyses or insufficient explanations by authors, weak evidence can be misconstrued as having greater significance than is justified. Therefore, critical consumers of pain research must be able to interpret the data on their own to determine clinical significance. Here is a recap of helpful pointers mentioned above:
- Some experts have suggested that research authors should always report confidence intervals and effect sizes, such as SMDs, for key outcome data, and some journals are requiring this [Durlak 2009] — but not in the pain field as yet.
- If only mean ± SEM data are given for two means being compared, or depicted in a graph, the SEM values can be multiplied by 2 to approximate the respective 95% CIs for better interpretation. That is, CI ≈ mean ± 2SEM.
- If only mean ± SD data are given, or depicted in a graph, and their values do not overlap (or overlap only slightly), it can be assumed that the means are significantly different statistically from each other (since SDs are larger than CIs).
- Beware of graphs, or data in tables, that are unclearly labeled or not labeled as to whether mean ± SEM, SD, or CI is being represented; at best, this suggests laziness by the authors (and editors) and, at worst, it might portend some sort of bias. One investigation found that 14% of the time research authors failed to report what the measure after the ± sign represented [ref in Altman and Bland 2005].
- Just because a difference between two means is reported as being statistically significant (in terms of P-value) does not mean that it represents a clinically important effect size.
- There are different types and variations of effect sizes that can be calculated from the same data. The presentation of SMD above — SMD = M1–M2 / [(SD1+SD2)/2] — approximating Cohen’s d, is one of the most common and easiest to understand approaches for determining the clinical significance of mean differences.
- If effect sizes are not provided in a pain research report, there also are easy-to-use online calculators available that readers can use to compute SMDs on their own (see above).
> Altman DG, Bland JM. Standard deviations and standard errors. BMJ. 2005;331:903 [access here].
> Cohen J. A Power Primer. Psych Bull. 1992;112(1):155-159 [article PDF here].
> Cumming G, Finch S. Inference by Eye: Confidence Intervals and How to Read Pictures of Data. Amer Psychologist. 2005(Feb-Mar);60(2):170-180 [PDF here].
> Durlak JA. How to select, calculate, and interpret effect sizes. J Ped Psych. 2009;34(9):917-928 [article PDF here].
> Dworkin RH, Turk DC, Katz NP, et al. Evidence-based clinical trial design for chronic pain pharmacotherapy: A blueprint for ACTION. Pain. 2011(Mar);152(3suppl):S107-S115 [access by subscription].
> Dworkin RH, Turk DC, McDermott MP, et al. Interpreting the clinical importance of group differences in chronic pain clinical trials: IMMPACT recommendations. Pain. 2009;146(3):238-244 [abstract].
> Dworkin RH, Turk DC, Wyrwich KW, et al. Interpreting the clinical importance of treatment outcomes in chronic pain clinical trials: IMMPACT recommendations. Pain. 2008;9(2):105-121 [abstract].
> Ellis PD. Effect size equations. 2009; online [access here].
> Hopkins WG. A New View of Statistics: Mean+SD or Mean+SEM? SportSci.org. 2000; online [available here].