Graham Elliott

The figure above shows a histogram of p-values across papers. Tests for p-hacking attempt to separate the shape of such distributions (of which there are many, depending on the powers of the tests and the true values of the parameters being tested) and distributions that are p-hacked.

Meta tests using p-values

Empirical work usually allows multiple bites of the cherry, there are many choices from data cleaning, model selection, estimator selection and other choices of what to report that allow a researcher to show the results of data analysis in the best light possible for the point they are trying to make. We worry at the costs of this, as Wasserstein and Lazar (2016, American Statistician) have pointed out ``Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting.''

In collaboration with Nikolay Kudrin and Kaspar Wütrich, I have two papers addressing meta tests using p-values to assess the honesty, i.e. lack of p-hacking, in a literature. The first of these, "Detecting p-Hacking" ( published version , working paper ) does three things. First, it makes clear conditions under which, in the absence of p-hacking, the distribution of reported p-values (p-curve) has a monotonically non increasing distribution. Second, it goes further in showing that for popular tests (those that are based on tests statistics that are asymptotically normal), p-curves are also continuous and are restricted by upper bounds on the distribution. Third, we suggest tests new to this literature that examine these features.

On the first of these, it was conjectured in previous papers and shown (mostly in Monte Carlo’s) for some special cases. Our results are not only very general (tests basically have to satisfy a condition akin to the monotone likelihood ratio condition) but we show the results for any possible distribution of true effects being tested (these are usually different from study to study).

The second set of results suggests new approaches to testing by restricting the set of possible p-curves under the null of no p-hacking. In particular ‘intuitive’ tests have looked for humps in the p-curve near traditional choices for significance, which is a violation of our conditions, however it is likely that other tests have more power. For example testing for discontinuities in the p-curve, or violations of the upper bounds.

The final set of results arises directly from the second set of results. We re-purpose tests in the literature for discontinuities and suggest them. We also consider other tests for the non monotonically increasing property, which look at the whole curve rather than the curve around certain significance levels.

A second paper, "The power of tests for Detecting p-Hacking" ( working paper ) examines when tests (existing and our tests) might actually be able to find that there is p-hacking. To understand power, we need to understand what types of p-curves will arise when researchers p-hack. To this end we consider four situations where researchers might engage in this activity – selecting control variables in a linear regression, choosing between different sets of instruments in linear IV, selecting between (or combining) multiple independent datasets measuring the same thing, and window length in an estimator of the variance in time series.

Within each problem we consider two approaches to p-hacking, a ‘thresholding’ approach where the obvious model is chosen if the p-value is below 5%, otherwise search amongst the possibilities and a ‘minimum’ approach that just takes the minimum value.

Overall, as is intuitive, p-hacking forces the p-curve to the left as it has the general effect of lowering p-values. For the thresholding approach, in each of the p-hacking situations, discontinuities in the p-curve arise at 5% significance. To the extent that other values might be chosen by some researchers we would expect the same at those levels. In nearly all cases, tests based on tests with asymptotic normal distributions exceed bounds. Humps in the p-curve around the chosen significance level are rarer – they depend strongly on the distribution of true effects being tested. If most of the tests are of the ‘strawman’ type (so are obviously wrong, p-values are small for the most part anyway) humps generally do not arise (except for the window length choice). They can arise if the majority of the hypotheses being tested are mostly true or tests for them are not so powerful themselves.

For the minimum approach, the distributions are simply shifted to the left. Distributions remain continuous so tests for this property are not helpful. Similarly no humps in the distributions appear so tests for this feature are also not useful. The only tests that can possibly detect this type of p-hacking are tests of the bounds introduced in our Econometrica paper. We show that such tests have power in each of the examples.

Overall though, power is small in the sense that even if there is widespread p-hacking (we consider situations where a proportion of the researchers p-hack and some do not) the power of many tests that have been used in the literature is small. This accords with a lot of the meta studies of p-hacking that have been undertaken, which tend to find either no p-hacking or weak evidence of p-hacking.

We can ask what this all means. The statement above on the costs can be thought of as a mixture of two problems. First, p-hacking means results in studies appear more significant than is true, or stated differently that the sizes of tests are understated. The second impact is that if we are reporting effects that go along with p-hacked tests, they overstate the effect being measured (for example p-hacked regression coefficients will be larger in absolute value than expected and hence biased ‘upward’).