The majority of research studies use 0.05 as a threshold for statistical significance. If p < 0.05, we say that the evidence against the null hypothesis is strong enough, so we reject the null hypothesis and accept the alternative hypothesis. The term "null hypothesis" is a general statement or default position that there is no relationship between two measured phenomena, or no difference among groups. An alternative hypothesis contradicts the null.
The term ‘significant’ is subject to a lot of misinterpretation. Statistical significance is not the same as clinical significance which is defined by a meaningful effect size.
Understanding the ‘significance’ of various tests becomes even more important in research studies which include multiple comparisons between different groups.
Let’s look at an example.
Your friend just won a lottery and you want to know what your chances are of winning a lottery if you bought a ticket. You decide to collect data of all the lottery winners in the past 10 years. You look at the area where the ticket was bought, the date, time, winner’s astrologic sign, age, sex, marital status, etc. Each of these is a statistical test in itself and if you test long enough, you will find something which is (statistically) significant. For example, if you look at women and their astrologic signs you may find that Gemini females had a significantly higher chance of winning the lottery (p=0.05) than non-Gemini females! You may assume that this statistically significant finding implies that there is only a 5% chance that this observation is due to chance alone. This is a classic misconception of the p-value, which doesn’t tell you how likely it is that your hypothesis is true given the data, but how likely the data is, given the (null) hypothesis.
This is a classic misconception of the p-value, which doesn’t tell you how likely it is that your hypothesis is true given the data, but how likely the data is, given the (null) hypothesis.
One thing that often surprises people about the p-value is it’s distribution under a true “null” effect. Let’s imagine I flip a coin 100 times, and get 55 heads. That corresponds to a p-value of 0.36. Now imagine I do the 100 flips again, getting 60 heads. That’s a p-value of 0.31. Many people assume that, when there is no effect (it’s a fair coin) the p-values cluster around some number like 0.50 or 0.99 or something. This is not true. Under no effect, the distribution of p-values is UNIFORM. The p-value you get is basically like picking a random number between 0 and 1. Five percent of the time, you get a number less than 0.05.
So when multiple tests of hypotheses are conducted, even if they are all false hypotheses, 5% of them are very likely to be statistically significant purely by chance. If we test 10 such hypotheses the probability that none will be significant is 0.9^10=0.6. This gives a probability of 0.4 (1 – 0.6 = 0.4) that at least one out of the 10 hypotheses is significant. Thus, doing multiple comparisons ( ‘x’ statistical tests) at a significance level of 0.05 decreases the value of 0.95x, and increases the probability of a significant difference and a type I error. In plain English, if you test 10 or more independent hypotheses at a significance level of 0.05, the chance that at least 1 of them will be incorrectly rejected is a whopping 40%, dramatically increasing the risk of false positives.
One of the solutions suggested to avoid this proliferation of type I errors is the Bonferroni correction. This is used more commonly due to the ease of its application. In the above example, if you are testing 10 hypotheses at a significance level of 0.05, simply divide 0.05 by 10 = 0.005 which is the significance level of each hypothesis to achieve an overall significance of 0.05. The problem arises when there is a large number of tests with which makes the Bonferroni correction too conservative. This is because as the number of tests increases, the p value with Bonferroni correction becomes smaller and smaller as a result of which the null hypothesis may not be rejected despite there being a significant difference in the groups. Thus, Bonferroni correction errs on the side of non-significance and can cause type II errors. Another problem during application of the Bonferroni correction is that it assumes that the tests are being performed on independent variables which is usually not true. For example- if you are comparing heights in two groups – and check them in feet and then in meters, using a p value of 0.025 (0.05 ÷ by 2) will not make sense as these two variables are not independent. If either has a p-value below 0.05, both will.
What’s the solution? Well, pre-specifying a primary hypothesis is standard operating procedure. Well-executed studies will pre-emptively specify the primary comparison being made between the groups and the statistical test used to make that comparison.
Other p-value “corrections” may be used in the setting of multiple comparisons. Most popular are various “false discovery rate control” procedures such as the Benjamini-Hochberg procedure, which is less prone to type 2 error than the Bonferonni correction.
Read more about the misconceptions surrounding the p value here.
Commentary by Manasi Bapat, Nephrology Fellow, Mount Sinai, New York
NSMC intern, class of 2018
with input from Perry Wilson and Laurie Tomlinson, NephJC statistical advisors