Significance Stars: A Cautionary Tale

As a quantitative researcher, you will inevitably face immense pressure to achieve “statistical significance.” The p-value, particularly the p < 0.05 threshold, is often treated as a definitive bright line between a successful and a failed study. This binary thinking, however, masks the nuanced reality of statistical evidence and can lead to flawed scientific conclusions.

Consider a scenario depicted in a 2019 Nature article where two studies investigate the same phenomenon and find the exact same observed effect. Yet, one is deemed a “significant” study (low P value) while the other is labeled a “non-significant” study (high P value). How can this be? The apparent paradox is resolved by observing that the “non-significant” study has a much wider confidence interval, which overlaps the “no effect” line. This difference in precision, not the underlying effect, drives the contradictory statistical conclusions and reveals a deep-seated misunderstanding in how we interpret evidence. Moving beyond a simplistic p < 0.05 mindset requires a more robust understanding of two related concepts: statistical power and the different dimensions of significance.


A “Non-Significant” Finding Is Not Proof of No Effect

Takeaway: Failing to reject the null hypothesis does not mean the null is true—it may simply mean your test lacked power.

When a researcher fails to reject the null hypothesis, it is tempting to conclude that there is no effect. However, this interpretation ignores a critical detail. In this situation, there are two possible states of the world: either the null hypothesis is true (there is genuinely no effect), or the null hypothesis is false, but the test had insufficient statistical power to detect the effect that truly exists.

To illustrate this distinction, consider a hypothetical scenario:

A company is being sued for racial discrimination in hiring. A judge orders them to conduct a hypothesis test where the null hypothesis is that the hiring rates for applicants from different racial groups are equal. The company analyzes a very small sample size—for instance, 10 hiring records from each group—and fails to reject the null hypothesis. They then declare to the judge, “Look, we don’t have any racial discrimination!”

This claim is fundamentally flawed. The test’s small sample size means it had very low power. It was unlikely to detect even a substantial difference in hiring rates if one existed. A failure to reject the null in a low-power study is not “evidence of absence”; it is an absence of evidence powerful enough to detect the effect.


Power Is Your Lens for Detecting an Effect

Takeaway: Statistical power determines whether your study can detect an effect if one exists—design accordingly.

Statistical power is formally defined as “the probability that a test rejects the null hypothesis.” Think of it as the sensitivity of your experiment—its ability to detect an effect if an effect is actually there. Power is not just a post-hoc calculation; it is a crucial component of study design. A power analysis helps answer the critical question: “Which effects will I be able to detect with high probability at my chosen sample size n?”

The relationship between statistical power, effect size, and sample size is fundamental to rigorous research design. Power increases dramatically as the hypothesized effect size moves away from zero. A study with a sample size of 500 has far less power to detect a small effect than a study with 10,000 subjects. This relationship underscores a critical reality: small effect sizes require very large sample sizes to be detected reliably.


There Are Three Kinds of Significance—And They Are Not the Same

Takeaway: Statistical, causal, and practical significance are distinct—a result can have one without the others.

The word “significant” is often used imprecisely, leading to confusion. To interpret research findings correctly, it is essential to distinguish between three distinct types of significance. A result can possess one type without the others; they are not required to be true at the same time.

  1. Statistical significance: We can reject the null hypothesis of no effect. For example, in the social pressure experiment on voter turnout, the difference between the “Neighbors” treatment and the control group was statistically significant, with a p-value near zero.

  2. Causal significance: We can interpret our estimated difference as a causal effect. The social pressure experiment had a strong randomized design, which allows researchers to claim that the observed effect was causal.

  3. Practical significance: The estimated effect is large enough in magnitude to be meaningful in the real world. The 8.1 percentage point increase in voter turnout from the “Neighbors” treatment is a massive effect in the context of political campaigns, making it highly practically significant.

For example, with an extremely large sample size (n), you might find a statistically significant effect that is too small in magnitude to have any practical importance. Conversely, an experiment might reveal a practically significant effect that fails to meet the threshold for statistical significance due to a small sample size. Causal significance, meanwhile, depends entirely on the study design (e.g., a randomized controlled trial) and cannot be inferred from a p-value alone.


The Quest for Significance Can Corrupt Science

Takeaway: The pressure to publish significant results incentivizes p-hacking and publication bias—distorting the scientific record.

The intense focus on achieving statistical significance can incentivize poor scientific practices. One of the most well-known is p-hacking, which involves performing numerous statistical analyses and selectively reporting those that yield a desirable p-value (e.g., below 0.05). The famous XKCD comic about jelly beans causing acne provides an intuitive illustration of this problem: if you test enough different hypotheses, one of them is bound to be “significant” by random chance alone.

A related and equally damaging issue is publication bias. This is the tendency for studies with statistically significant results to be published while those with null or contradictory results are not. The “Chart of the Decade” from Kaplan and Irvin (2015) visualizes this phenomenon starkly. It shows a notable shift in the distribution of published study outcomes after the year 2000, when pre-registration of primary outcomes became a requirement on ClinicalTrials.gov. The subsequent increase in published null results suggests that, prior to this change, studies with null or even harmful results were previously less likely to be published, skewing the scientific record toward positive findings.


Toward a More Thoughtful Practice

A responsible and effective research practice involves moving beyond the simple pursuit of significance stars. The concepts of power and the different facets of significance provide a framework for a more nuanced and honest approach to quantitative analysis. Robust research requires designing studies with adequate power to find meaningful effects and interpreting the results through the multi-faceted lens of statistical, causal, and practical significance. Encouragingly, academic norms are beginning to evolve. Major journals across disciplines, from the Journal of Marketing to The Economic Journal, are changing their reporting policies to de-emphasize arbitrary thresholds and significance stars in favor of more comprehensive reporting. As these norms evolve, how can we, as individual researchers, better align our daily statistical practices with our ultimate goal of discovering the truth?




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • DDD Estimators and Distributional Effects
  • Why Causal Mechanisms Are Harder Than You Think
  • Why Better Models Can Create Stranger Counterfactuals
  • What Directed Acyclic Graphs (DAGs) Teach Us About Choosing Covariates
  • Three Fundamental Conceptual Shifts in Causal Inference