Part 4 – The Problem with P-Values
There is a common misperception that “statistical significance” as denoted by low P-values signifies the strength and importance of research results, and is the hallmark of a “good” study. However, this is far from the truth and, in fact, many allegedly worthwhile research studies with statistically significant findings actually depict small differences or effects that are trivial or have little relevance for pain practice. In short, there are significant problems with P-values.
In preceding articles of this series [here] we noted that some experts have warned of how researchers’ uses of statistics often spawn faulty findings. And, as Albert Einstein pointed out long ago, “Not everything that counts can be counted, and not everything that can be counted counts.” Along those same lines we would add, Not everything that is statistically significant is significant. Understanding this is vital for becoming a more critical consumer of pain research.
Research Begins with a Negative: The Null Hypothesis
This discussion of “significance” was introduced in Part 1 of this series [here] in which we noted that all medical research is inherently imperfect and absolute certainty is an illusion. Most researchers appreciate that there is always a certain probability of unknown, chance, or random influences that might produce false or misleading results. This also recognizes that all things in nature are possible and may occur with or without outside intervention; for example, some pain conditions in some persons get better on their own, without treatment.
As a way of compensating for such serendipity, researchers start with a conservative but seemingly negative assumption that any differences or associations found in their outcome measures are due to random influences or chance and, in essence, there are no real differences or effects. This is known as the null hypothesis.
Obviously, researchers hope to demonstrate that important and valid differences, associations, or effects actually do exist and are of a sufficient size that the null hypothesis can be rejected — that the observed outcomes did not come about merely by chance or due to some sort of random fluke. However, there are two types of error that are of concern in the analysis of research outcomes:
- Concluding that there was a true effect — eg, saying a treatment made an important difference — when none actually existed (called a “false positive,” “alpha error,” or “Type I error”)
- Conversely, saying that any effects were due to chance — eg, the treatment failed to make a difference — when there actually were true effects (called a “false negative,” “beta error,” or “Type II error”).
Taming Type I Error & Finding Significance
Every researcher wants to be able to state that they found differences, associations, effects, or other measures in their study that were robust and important — they were ”significant.” A most common, but not necessarily the best, way to identify such findings is to use inferential statistical tests of significance. Such tests estimate how likely it is that a given outcome occurred by chance alone. The “inference” part comes from the fact that research looks at only a small sampling of an overall population that might be affected and, based on limited data (often expressed in average numbers), researchers predict or infer what the outcome might be in the much larger population.
An extensive array of such significance tests are available, especially with today’s computerized biostatistics programs — X², t-tests, multivariate analyses of variance, logistic regression, correlation, and many more. Their proper application depends on the type of data being examined and, while these tests are sometimes used incorrectly by researchers, assessing their appropriateness and validity for testing the particular data is beyond the skills of most readers. We simply need to trust that the researchers (or, at least, their statistician consultants) knew what they were doing, and hope that they will explain in their report what they did and why.
All tests of significance have one thing in common — they generate P-values, or the probability of an outcome occurring by chance alone if the null hypothesis is actually true. In other words: what are the chances of finding the particular result — the difference, effect, etc. — even if it is just a fluke? A P-value can be generated for practically any type of data and is a number ranging in size theoretically from 1.0 (ie, 100% certainty that the finding is due to chance) to infinitely small (ie, great certainty that the finding in not due to chance alone).
Researchers decide in advance what the level of statistical significance (the P-value) for their data should be to conclude that the finding did not occur by chance (and the null hypothesis can be rejected). For example, a standard minimum level of significance at or below P = 0.05 is often considered sufficient to reject the null hypothesis of no difference; however, in doing that, the researchers still accept that up to 5% of the time their conclusion might be in error, such as stating that a treatment had a statistically significant effect when it actually did not.
Stated from a more positive perspective, with a P = 0.05 level of significance, the researchers might feel 95% (1.0 minus 0.05) confident that their finding was not due merely to chance. If the researchers want to be more conservative, achieving more certainty, they might set the minimum lower, such as P ≤ 0.01; that is, only 1% or less of the time, or ≤1 time in 100, the finding might be erroneously stated as being significant when it is not.
Finding a low P-value for primary endpoint measures is often viewed as the most important sign of a “good” or “positive” study. However, as noted above, the P-value merely describes the degree of uncertainty or error in asserting that there was a real difference or effect when it probably just came about by chance or randomness.
Significant Problems with P-Values
Despite the fact that P-values often are considered a central focus of published research reports, and indicators of strong and important evidence, this statistic is somewhat of a “best guesstimate” and actually does not tell us much about the data or validity of the findings. Furthermore, there are numerous ways in which P-values are misinterpreted or misleading:
- A major mistake is to assume that the smaller the P-value (eg, P = 0.0001) the stronger and more important the result. As described above, the P-value only addresses the question of whether the result probably came about by chance and, therefore, the null hypothesis of no real difference or effect should be accepted. It does not examine or indicate whether the result itself is of any particular magnitude of strength or importance. Whether the P-value is 0.05 or 0.001, a result might be deemed “statistically significant”; whether or not it is of any relevance or consequence for clinical practice or other purposes is another matter (judged by other types of statistics noted at the end of this article).
- While a P-value suggests the probability that a result might have occurred merely by chance, it does not tell which “bucket” the data actually fall into: the one due to chance or the one not due to chance. For example, a mean difference between groups with P = 0.01, tells us that there is a 99% probability that the difference was not due to chance alone, but there also is at least a small 1 in 100 possibility that the value was a mere coincidence — and we do not know which is true.
- Statistical significance is an all-or-nothing proposition: either data reach or surpass the predetermined level (eg, P ≤ 0.05) or they do not — coming close does not count. Sometimes, when data go slightly over the mark (eg, P = 0.052), researchers will say that there was a “trend toward significance” to imply that there probably was a “good” result but it was weak; this is blatantly misleading and a violation of statistical principles.
- Calculations of P-values are very sensitive to sample size and can produce results that are irrelevant. Especially in studies with large sample sizes, differences between groups or data may be statistically significant but so small as to not be clinically meaningful.
Example: In a large study examining two groups totaling more than 54,000 persons, the mean [standard deviation] ages were 81.69 [7.11] and 81.96 [7.17] years, which were found to be significantly different (P < 0.001) despite the difference being only a few months. However, the Standardized Mean Difference, or SMD, in ages between groups — a better test of clinical significance — was only 0.038, which was of no importance. [SMD will be discussed in a future article.]
- On the other hand, small studies are quite prone to chance effects. For example, clinical trials with too few participants are highly subject to Type II errors (false negatives, or not finding statistically significant outcomes when they actually exist). Conversely, in small studies there can be participant selection biases that could skew data in a more positive direction than would normally occur, producing Type I error (false positive) findings.
HARKing & Deceptions of Multiple P-Values
Particularly troublesome misrepresentations of data arise when multiple tests of significance are performed on the same set of data. This is becoming common in pain research, as investigators use large electronic databases from government (eg, public welfare beneficiaries) or private (eg, insurance plans, healthcare systems) sources to retrospectively explore hypotheses of interest involving multiple variables. We have previously cautioned about this sort of “data mining” approach [here], which sometimes amounts to a fishing expedition seeking to snag “significant” results that fit researchers’ agendas.
Problems arise because if researchers examine the same large body of data long and hard enough “significant” results will inevitably be found, if nothing other than by chance. For example, it has been calculated that if 20 tests of significance are run on the same dataset there is a 64% chance that one or more significant results will be found at the P = 0.05 level [see Bland and Altman 1995]. And, while there is a 1 in 20 chance of any of those results being spurious, the researchers will still have some “significant” outcomes to discuss in their report.
There are simple ways of correcting for this misrepresentation; although, this is rarely reported in the pain literature. One of these is the “Bonferroni correction,” which essentially says that when multiple comparisons are made using the same data a more stringent level of significance should be required. For this approach, the predetermined level of significance (eg, P = 0.05) is divided by the number of tests run on the data; eg, in the case of 20 tests of significance, the required P = 0.05/20 or 0.0025.
With large datasets of electronic medical records and computerized statistics programs it is easy to run hundreds of tests that produce interesting and statistically significant results. This may lead to what psychologist Norbert Kerr  described as “HARKing” — Hypothesizing After the Results are Known. That is, researchers present outcomes as significant findings based on results that were discovered during their investigation, or post hoc, as if the results were the original object of their study all along and the hypotheses had been determined in advance (a priori) as should have been done.
Of course, researchers are not eager to admit to HARKing, and there is no way of detecting this since such investigations are not registered in advance, as are most prospective clinical trials these days. However, retrospective research on large databases lend themselves to the fortuitous discovery of seemingly important outcomes that are worthy of reporting, even though the probability of Type I error (ie, false positives) is quite high in such cases. This may not bode well for the legitimacy of future research that exploits the ever-expanding databases of electronic healthcare records worldwide.
How Should “Significance” Be Judged?
In sum, statistical significance testing asks the question: “Is the result merely a random or chance outcome, or does it represent some real difference, effect, association, or other measure?” The P-value gives an estimate of the extent to which the outcome may or may not have been a chance finding. However, what healthcare providers really want to know is: “Given that the outcome probably did not come about purely by chance, does it have any meaning or significance for pain practice?”
Too often, researchers stop short in their analyses, seemingly satisfied when calculated P-values suggest their findings are statistically significant. In many cases this is tantamount to their saying something like, “We found a slight difference in treatment effects, but it did not come about merely by chance so it must be important.” They often do not go on to explain how their “statistically significant” findings actually have little clinical relevance or significance. Researchers, and journal editors, are loathe to publish studies portraying lackluster outcomes; so, the burden of determining clinical relevance often falls upon consumers of pain research literature.
Experts in evidence-based medicine have long advocated for the usefulness of measures other than P-values for determining clinical significance and for decision-making purposes. These include: confidence intervals, estimations of qualitative effect sizes, power analyses, absolute risk/benefit differences, and numbers-needed-to-treat (NNT). These concepts are often ignored in pain research reports, or not presented in a helpful context, and will be discussed in upcoming articles in this series on “Making Sense of Pain Research.”
> Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. BMJ. 1995(Jan 21);310:170.
> Gaudia G. When Is a Difference Not a Difference: Medicine or Shooting Craps? Medscape Gen Med [online]. 2005;7(3) [available here].
> Glantz SA. Primer of Biostatistics. 4ed. New York, McGraw-Hill; 1997.
> Greenhalgh T. Statistics for the non-statistician. II: "Significant" relations and their pitfalls. BMJ. 1997;315(7105) [available here].
> Guyatt GH, Sackett DL, Cook DJ. Users’ guides to the medical literature: II. How to use an article about therapy or prevention; B. What were the results and will they help me in caring for my patients? JAMA. 1994;271(1):59-63.
> Higgins JPT, Green S (eds). P values and statistical significance. Section 12.4.2. Cochrane Handbook for Systematic Reviews of Interventions, 2010 [available here].
> Kerr NL. HARKing: Hypothesizing After the Results are Known. Pers Soc Psychol Rev. 1998(Aug);2(3):196-217 [abstract here].
> Siegfried T. Odds Are, It’s Wrong. Science News. 2010(Mar 27);77(7):26 [here by subscription].
> Statistical Significance Terms. Medical University of South Carolina, 2001 [online here].
> Sterne JAC, Smith GD. Sifting the evidence – what’s wrong with significance tests? BMJ. 2001(Jan);322:226-231.
> Williams F. Reasoning With Statistics. New York: Holt; 1968.