Statistical Significance: A Survey of the Literature for the Pragmatic Data Scientist

“‘Objectivity though in practice as unattainable as infinity, is useful in the same way, at least as a fixed point of theoretical reference.  A knowledge of one’s own subjectivity is necessary in order even to contemplate the ‘objective’…Terms such as ‘neutral’, ‘detached’, let alone ‘fair-minded’, ‘disinterested’ or ‘even-handed’ do not all convey the same meaning; they are merely aestheticized forms of the same subjective aspiration.”

– Christopher Hitchens, Why Orwell Matters[1]

The debate surrounding the use of null hypothesis significance testing (NHST)[2] is not just an academic matter. The use of NHST, and p-values in particular, may be contributing to the replication crisis in academia[3], but it also presents problems for model evaluation worthy of consideration for the pragmatic data scientist working in increasingly model-driven businesses like The General®. 

The rapid adoption of statistical methods applied to large datasets in recent years has allowed businesses to make evidenced-based decisions; however, according to the American Statistical Association’s (ASA)[4] 2016 statement clarifying the correct use of p-values, the rapid adoption raises concerns on the nature of

…conclusions drawn from research data. The validity of scientific conclusions, including their reproducibility, depends on more than statistical methods themselves. Appropriately chosen techniques, properly conducted analyses and correct interpretation [emphasis added] also play a key role in ensuring that conclusions are sound and that uncertainty surrounding them is represented properly.[5]

In its statement, the ASA provided the following six principles for the use of p-values, reproduced verbatim:

  1. P-values can indicate how incompatible the data are with a specified model.

  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

  4. Proper inference requires full reporting and transparency.

  5. A p-value, of statistical significance, does not measure the size of an effect or the importance of a result.

  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

The ASA’s statement on best practices was the first of its kind for the 180-year-old organization, and is warranted; many studies indicate the misuse of p-values is commonplace. A 2019 analysis in Nature of 791 articles from five leading journals found that 51% of authors mistakenly interpreted p-values to assume non-significance meant “no effect”.[6] Nuijten et al. (2016) used the R package statcheck [7] to analyze 250,000 p-values in psychology journals from 1985-2013 and found half contained a p-value inconsistent with its test statistic and degrees of freedom. Furthermore, Nuijten et al. found that reporting errors were higher when studies indicated statistical significance than when none was indicated, suggesting a “systematic bias in favor of significant results.”[8]

Errors in reporting are bound to complicate replication. The Reproducibility Project, which consists of 270 psychologists seeking to replicate results from 100 psychology studies in top publications Science and ironically, Nature, given the previously mentioned 2019 analysis on p-values conducted by that publication, over a 15-year period that was only able to replicate two-thirds of results, often only partially and with typically less powerful results than the original studies reported.[9]

The rise of NHST from the 1930s didn’t begin to draw significant criticism until the 1970s, and even then the tide didn’t begin to swell until the 1990s. Currently, with calls for the ban of p-values increasing, critics appear to have momentum on their side. In the spirit of not letting prevailing wisdom go unexamined—as it did for nearly four decades after Fisher’s insight—we’ll attempt an “objective” survey of the literature and offer our view on the correct use of NHST in guiding business decision making. 

Brief History Lesson (i.e., “How We Came to the Present Disagreement”)

Scotsman John Arbuthnot[10] is frequently attributed with calculating the first p-value in 1710. In An Argument for Divine Providence, taken from the Constant Regularity observed in the Births of both Sexes[11], Arbuthnot estimated the male-female birth ratio by examining Christening records in London over an 82-year period from 1629 to 1710. In every year, male births exceeded female births, and given his prior assumption of an equal birth rate, Arbuthnot concluded the probability of his observation to be 1 in 4.836e21, with the ultimate cause for the high male ratio being an evolutionary benefit intended by a “provident Nature, by the Disposal of its wise Creator.”[12]  French scholar Pierre-Simon Laplace[13]investigated the same question with a parametric test using binomial distribution, though unlike Arbuthnot, Laplace did not state a cause for the difference in birth ratios. 

While the logic of hypothesis testing was evident in these works, the formal introduction of p-values is attributed to Karl Pearson[14] though it was popularized by Sir Ronald Fisher in his Lady Tasting Tea experiment.[15],[16] Fisher’s approach prevailed over the following decades; indeed for “the next 50 years almost all theoretical statisticians were completely parameter bound, paying little or no heed to inference about observables.”[17]

Interestingly, Fisher himself in a way anticipated the current debate, which revolves around two positions, that NHST is 1.) used improperly, or, 2.) faulty by design. Fisher himself saw potential for misuse, warning that “a test of significance does not authorize us to make any statement about the hypothesis in question in terms of mathematical probability” and that such tests “contain no criterion for ‘accepting’ a hypothesis and on the other hand, we cannot safely reject a hypothesis without knowing the priors. Significance without priors are the ‘flaw in our method.’”[18] 

However, in terms of design, Fisher saw the threshold for p-values—often seen by critics as the main “bug” due to its tendency to lead to binary decision making—as a feature of his experimental design:

It is convenient to draw the line at about the level at which we can say: “Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.” …. If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point) . . . Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.[19]

To be sure, the following survey is not exhaustive, though we have attempted to offer a variety of perspectives.

In Defense of NHST

There are several strains running through the papers in the defense camp: one group seeks to reform current practices by making adjustments to how p-values should be used (Benjamin and 71 co-authors, 2018); another states that despite substantial flaws, p-values are better than available alternatives (Verhulst, 2016). Another group believes p-values works well in theory (Savalei and Dunn, 2015 and Hoover and Siegler, 2008) with misuse arising due to human error; another group states that p-values is not fine in theory, but remains a viable tool for the pragmatic statistician (Krueger, 2003).

Benjamin and 71 co-authors (2017)[20] suggest changing the threshold from p < 0.05 to 0.005 to address the “high rate of false positives even in the absence of other experimental, procedural and reporting problems.” Their recommendation is “restricted to claims of discovery of new effects” and is not intended to address the “appropriate threshold for confirmatory of contradictory replications of existing claims.” 

Verhulst (2016)[21] concedes criticisms of p-values are valid—they are over-relied on, misunderstood, and easy to abuse. However, when used properly and considerations are given to effect size, they remain useful. The prevalence of p-values is due to their ability to offer quick decision-making ability about the validity of methods, violating ASA’s principle #6. The preference for simple solutions to complicated problems leads Verhulst to believe “alternative methods of hypothesis testing [referring to confidence intervals] will likely fall victim to the same criticisms currently leveled at p-values if more fundamental changes are not made in the research process.” Rather than replacing p-values with confidence intervals, Verhulst suggests “increasing the general level of statistical literacy and enhancing training in statistical methods to provide a potential avenue for identifying, correcting, and preventing erroneous conclusions from entering the academic literature.” Emphasizing the role of education as a means to improving the replication crisis is also supported by Leek and Peng (2015).[22]

Savalei and Dunn (2015)[23] also argue that there is no suitable replacement for NHST that is not fraught with more errors. Focusing on the replication crisis in psychology, Savalei and Dunn claim the idea that NHST alone is causing the replicability crisis is incorrect. They cite several analyses that show studies relying on confidence intervals instead of p-values are prone to the same errors due to similar reliance on “binary thinking” with both methods, violating ASA principle #6 with regard to p-values.[24],[25]

Savalei and Dunn discuss the potential for a second alternative to p-values, Bayes Factors.[26] Such a move would shift from the binary thinking inherent with H1: m > 0 to a more holistic method by answering the question, “is my hypothesis supported by the data?” They claim the use of Bayes Factors is not without problems: usage may lead to “group thinking” in journals, and “b-hacking is possible just as p-hacking is.” Until sufficient evidence indicates that abandoning p-values will “fundamentally change the credibility and replicability of psychological research,” Savalei and Dunn suggest a “return to shared core values by demanding rigorous empirical research before instituting major changes.”

Hoover and Siegler (2008)[27] offer a rebuttal to fellow economist Deirdre McCloksey, a fierce critic of the use of NHST in her profession. Hoover and Sieghler reject McCloskey’s central claim that economists “systematically” mistake statistical significance for sizeable meaning (“oomph” in her words). They see no evidence by reviewing McCloskey’s surveys that economists by and large are misunderstanding the distinction between statistical and economic significance. They see the large effect and lack of significance combination in results as a valid call for further investigation and data collection, and argue this is the position of economists in general. Hoover and Siegler outline the “proper” approach to statistical significance in line with ASA principles, and demonstrate through analyses of top journals that contrary to McCloskey’s claims, “real science” such as physics, chemistry, etc., do rely on NHST (though Hoover and Siegler’s analysis also indicates that Economics relies on it more than all other disciplines).

In a widely cited paper, Krueger (2001)[28] assesses NHST on both philosophical and statistical grounds. On philosophical grounds Kruger claims that a “logical analysis reveals irreparable limitations to NHST” (see his compelling reasoning on page 17); however, Krueger concludes that NHST “rewards the pragmatic scientist” who tends to focus on replication. To Krueger, NHST facilitates replication and “successful replications push the null hypothesis forward toward improbability.” Replication need only be pursued in an effort to validation results at a high level, in a directional sense, rather than getting exactly similar results. Differences in estimates are inevitable due to variance in experimental design, processes, assumptions, and samples. Krueger ultimately concludes “given the pragmatic benefits of NHST and the lack of a superior alternative, an all-out ban seems unnecessary.”

In Opposition to NHST

Voices in opposition to NHST reflect several strains. The fiercest detractors seek a paradigm shift away from NHST entirely, and can be found with Ziliack and McCloskey (2008).[29] McShane et al. (2018) stop short of seeking an outright ban but receive considerable support from their peers; Trafimow (2019) offers guidelines for academic journals, and Killeen (2005) proposes a metric to estimate the probability of replication as an alternative to p-values.

Ziliack and McCloskey (2008)[30] state “for the past eighty-five years it appears that some of the sciences have made a mistake by basing decisions on statistical significance.” In their view, the prevalence of statistical significance has allowed “sizeless scientists” to prefer “precision” over “ommph”, that is to prefer small, reliable effects over the sizeable effects that actually matter. The authors offer an analogy in which the preference for precision means choosing a weight loss pill that will reliably deliver small weight loss results within a narrow (predictable) range over a weight loss pill that will reliably deliver large results with more variance.  While the degree to which the authors make their claim would have one believe they are lone voices in the wilderness, their claims are nearly identical to ASA guidelines #3 (significance is not importance) and #5 (p-values should not alone be used to prescribe policy). 

McShane et al. (2018)[31] discuss the problems with NHST and the alternatives proposed by Benjamin et al., but ultimately reject them. For instance, adjusting the p < 0.05 threshold on future research is rejected on the basis that moving the goalpost does not solve the reproducibility problems for the existing body of work.

In the view of McShane et al., the use of NHST contributes to “erroneous scientific thinking” and that

too often scientific conclusions are largely if not entirely based on whether or not a p-value crosses the 0.05 threshold instead of taking a more holistic view of the evidence that includes the consideration of the currently subordinate factors.

Such a practice is a violation of ASA principle #3 (scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold). McShane et al. seek to “demote” p-values from the position as the dominant metric used for research in biomedical and social sciences, and suggest proper use should be in a limited way alongside a variety of other tools and domain knowledge. This may be taken as a slight walk-back from their titular prescription, “Abandon Statistical Significance”. McShane et al.’s position was endorsed by over 800 signatories in a subsequent article in Nature.[32]

Trafimow (2019)[33] offers guidelines for how journals should evaluate suggestions in a “Post p < 0.05 Universe.” Notable among these is Trafimow’s guideline that editors tolerate some ambiguity in results in an effort to reduce the incentive for researchers to produce results that not only are directionally certain, but offer point estimates. He further emphasizes the importance of “emphasizing thinking and execution, not results” including placing importance on replication to get around “p-value thinking.” This could be done by publishing a distribution of p-values rather than a single p-value possibly reached only a single time after many experiments. Trafimow also suggests “increased transparency, disclosure of conflicts of interests, and rendering data accessible for others to perform analyses,” all compatible with the spirit of ASA principle #4 (proper inference requires full reporting and transparency).

Killeen (2005)[34] proposes an alternative to p-values, the probability of replication, Prep, also favored by NHST supporter Krueger (2001). Replication is not intended to reproduce specific parameter estimates, but rather “replication as an effect of the same sign as that found in the original experiment.” This view might not satisfy McCloskey or others concerned with effect size, but is appealing to psychologists “who rarely need parameter estimates,” being “more typically interested in whether a causal relationship exists between independent and dependent variables.” Killeen cites empirical evidence supporting the ability of Prep to predict the probability of replication; for example Lober (2014) concluded 70% of 37 studies showed a negative correlation between heart rate and aggressive behavior patterns. The median value of Prep calculated for those studies was .71.

Summary

“Statistics is hard.”

– McShane et al. (2019)[35]

The vigorous NHST/ p-value debate has helped shed light on problems contributing to the current replication crisis. In the process, the debate has certainly caused research scientists to proceed with more caution when engaging in NHST, nudged editors of journals to reevaluate publishing practices, and pushed those at the research frontier to discover and validate superior alternatives. 

The ASA’s six principles clarifying the correct use of p-values receive support from defenders and detractors alike. In our assessment of the literature, we find that is consensus among the defender and detractors in the following ways:

  1. There is no single clear adjustment to, or replacement for, p-values that does not bring with it a handful of new challenges. Those challenges will be similar to those with p-values, largely due to the presence of “binary thinking” and,

  2. No one is advocating that p-values be used as a binary check for validity; in fact, papers show consensus on the opposite, that statistics is not the place for black and white thinking, but rather a holistic attitude towards model evaluation. Along these lines,

  3. No one is advocating p-values be used in isolation, absent of consideration of coefficients, confidence intervals, sample size, or domain knowledge. In our view, studies show consensus that these are important considerations when interpreting p-values, which,

  4. According to no one, should be interpreted incorrectly. There is considerable consensus that more education on the correct role and use of NHST and p-values is warranted, and some see apprenticeships in the private sector as a means of achieving this goal.

  5. Transparency is a universal value, and, being vital to successful experimentation and replication, should be increased by both researchers and journal editors.

At The General® we use NHST and p-values in the spirit of the ASA’s principles and this limited consensus, not as stand-alone checks, but in a holistic way alongside a variety of other methods. We are less concerned with novel findings than those on the research frontier, and focus on building practical models that stand up to rigorous validation, extensive testing, and replication. Variable selection and model output is heavily influenced by knowledge of not just statistics, but the decades of experience in the auto insurance industry from our senior data science team leaders and business owners throughout the company.

After extensive exploratory analysis, modeling, testing, and then exposing our work to the constructive criticism of domain area experts, we then deploy [36] and monitor our models[37]. If our models aren’t meeting expectations set during the training phase, this cannot be ignored, and we must go back to the drawing board. In this environment, long-run significant errors are likely to have a limited lifespan, lending support to Krueger’s position that NHST and p-values, while fraught with errors both by design and often in use, can nonetheless remain a valuable tool to the pragmatic data scientist. 


WORKS CITED

[1] Wikipedia contributors. “Why Orwell Matters.” Wikipedia, The Free Encyclopedia, February 26, 2019. Web. June 3, 2019.  https://en.wikipedia.org/wiki/Why_Orwell_Matters

[2] Wikipedia contributors. “Statistical hypothesis testing.” Wikipedia, The Free Encyclopedia. May 11, 2019. Web. June 3, 2019.  https://en.wikipedia.org/wiki/Statistical_hypothesis_testing

[3] Wikipedia contributors. “Replication crisis.” Wikipedia, The Free Encyclopedia. June 3, 2019. Web. June 3, 2019.  https://en.wikipedia.org/wiki/Replication_crisis

[4] American Statistical Association,  https://www.amstat.org

 [5] Ronald L. Wasserstein and Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician, 2016.  https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108?needAccess=true

 [6] Supplementary information to: Retire statistical significance Valentin Amrhein et al., published in Nature, 2019.  https://www.nature.com/magazine-assets/d41586-019-00857-9/data-and-list-of-co-signatories

 [7] Statcheck, library documentation, https://www.rdocumentation.org/packages/statcheck/versions/1.3.0/topics/statcheck

 [8] Michèle B Nuijten et al. “The prevalence of statistical reporting errors in psychology (1985-2013)” Behavior research methods, 2016.  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5101263/

 [9] Eric Loken, “Confronting the Replication Crises Could Lead to Better Science,” Undark, April 9, 2019.  https://undark.org/2019/04/09/confronting-the-replication-crisis/

 [10] Wikipedia contributors. “John Arbuthnot.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, May 17, 2019. Web. June 3, 2019. https://en.wikipedia.org/wiki/John_Arbuthnot

 [11] John Arbuthnot, “An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes,” Philosophical Transactions of the Royal Society of London, 1710.  https://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf

 [12] John Arbuthnot, “An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes,” Philosophical Transactions of the Royal Society of London, 1710. https://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf

 [13] Wikipedia contributors. “Pierre-Simon Laplace.” Wikipedia, The Free Encyclopedia, May 24, 2019. Web. June 3, 2019. https://en.wikipedia.org/wiki/Pierre-Simon_Laplace

 [14] Wikipedia contributors. “Karl Pearson.” Wikipedia, The Free Encyclopedia, May 8, 2019, Web. June 3, 2019. https://en.wikipedia.org/wiki/Karl_Pearson

 [15] For a high-level explanation see: Wikipedia contributors. “Lady tasting tea.” Wikipedia, The Free Encyclopedia, March 27, 2019. Web. June 3, 2019.  https://en.wikipedia.org/wiki/Lady_tasting_tea

 [16] Section II below discusses the experiment: Sir Ronald A. Fisher, The Design of Experiments. New York: 1971.  https://www.phil.vt.edu/dmayo/PhilStatistics/b%20Fisher%20design%20of%20experiments.pdf

 [17] Geisser (1992) as quoted in: Peter R. Killeen, “An Alternative to Null-Hypothesis Significance Tests,” Psychological Science, May 16, 2005.  https://pdfs.semanticscholar.org/cbe7/8b8bc56440319c1c9724d2413868321c22c4.pdf

 [18] Peter R. Killeen, “An Alternative to Null-Hypothesis Significance Tests,” Psychological Science, May 16, 2005.  https://pdfs.semanticscholar.org/cbe7/8b8bc56440319c1c9724d2413868321c22c4.pdf

 [19] Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, 2008.  https://www.deirdremccloskey.com/docs/jsm.pdf

 [20] Daniel J. Benjamin et al., “Redefine statistical significance,” Nature, September 1, 2017. https://www.nature.com/articles/s41562%20017%200189%20z

 [21] Brad Verhulst, “In Defense of P Values,” AANA Journal, 2016. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5375179/

 [22] Jeffrey T. Leek and Roger D. Peng, “Statistics: P values are just the tip of the iceberg,” Nature, Comment, April 28, 2015.  https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412

 [23] Victoria Savalei and Elizabeth Dunn, “Is the call to abandon p-values the red herring of the replicability crisis,” Frontiers in Psychology, March 6, 2016. https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00245/full

 [24] Hoekstra et al. in the lone experiment study, concluded there was no discernible reduction in the presence of binary thinking with the use of CIs instead of p-values. See:  Hoekstra, R., Johnson, A., and Kiers, H. A. L. “Confidence intervals make a difference: effects of showing confidence intervals on inferential reasoning,” Educ. Psychol. Meas., 2012, 72, 1039–1052. doi: 10.1177/0013164412450297

 [25] Misinterpretation of the meaning of CIs could be a widespread of that as that of p-values according to: Belia, S., Fidler, F., Williams, J., and Cumming, G., “Researchers misunderstand confidence intervals and standard error bars,” Psychol. Methods 10, 2015, 389–396. doi: 10.1037/1082-989X.10.4.389

 [26] FelixS, “What does a Bayes factor feel like,” R-bloggers, January 5, 2015. https://www.r-bloggers.com/what-does-a-bayes-factor-feel-like/

 [27] Kevin D. Hoover and Mark V. Siegler, “Sound and fury: McCloskey and significance testing in economics,” Journal of Economic Methodology. March 2008. http://public.econ.duke.edu/~kdh9/Source%20Materials/Research/Sound%20and%20Fury%20Published%20Version.pdf

 [28] Joachim Krueger, “Null Hypothesis Significance Testing: On the Survival of a Flawed Method,” American Psychologist, January 2001.  http://files.clps.brown.edu/jkrueger/journal_articles/krueger-2001-null.pdf

 [29] Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, 2008. https://www.deirdremccloskey.com/docs/jsm.pdf

 [30] Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, 2008.  https://www.deirdremccloskey.com/docs/jsm.pdf

 [31] Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer Tacket, “Abandon Statistical Significance,” The American Statistician, 2019, Vol. 73, No. S1, 235-245. https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1527253

 [32] Amrhein, Greenland, and McShane, “Scientists rise up against statistical significance,” Nature, March 20, 2019. https://www.nature.com/articles/d41586-019-00857-9

 [33] David Trafimow, “Five Nonobvious Changes in Editorial Practice for Editors and Reviewers to Consider When Evaluating Submissions in a Post p < 0.05 Universe,” The American Statistician, 2019.

 [34] Peter R. Killeen “An Alternative to Null-Hypothesis Significance Tests,” Psychological Science, May 16, 2005.  https://pdfs.semanticscholar.org/cbe7/8b8bc56440319c1c9724d2413868321c22c4.pdf

 [35] Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer Tacket, “Abandon Statistical Significance,” The American Statistician, 2019, Vol. 73, No. S1, 235-245.  https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1527253

 [36] Michael Kehayes, “Machine Learning Deployment – Part I: Awareness,” AutoRegressed, December 17, 2018. https://www.autoregressed.com/blog/machine-learning-deployment-part-i-awareness

 [37] Jack Pitts, “Day in the Life: Machine Learning Engineer,” AutoRegressed, April 22, 2019.  https://www.autoregressed.com/blog/day-in-the-life-machine-learning-engineer