Multiplicity correction - Romeo and Julia, where Romeo is Basic Statistics

5.6 Multiplicity correction

In the previous section we performed a pairwise t-test for the following comparisons:

spA vs spB,
spA vs spC,
spB vs spC.

The obtained p-values were

postHocPvals

[0.025111501405268754, 0.0003985445257645916, 0.049332195639921715]

Based on that we concluded that every group mean differs from every other group mean (all p-values are lower than the cutoff level for \(\alpha\) equal to 0.05). However, there is a small problem with this approach (see the explanation below).

In Section 4.7.5 we said that it is impossible to reduce the type 1 error (\(\alpha\)) probability to 0. Therefore if all our null hypothesis (\(H_{0}\)) were true we need to accept the fact that we will report some false positive findings. All we can do is to keep that number low.

Imagine you are testing a set of random substances to see if they reduce the size (e.g. diameter) of a tumor. Most likely the vast majority of the tested substances will not work (so let’s assume that in reality all \(H_{0}\)s are true). Now imagine, that the result each substance has on the tumor is placed in a separate graph. So, you draw a boxplot (like the one you will do in the upcoming Section 5.7.5). Now the question. How many graphs would contain false positive results if the cutoff level for \(\alpha\) is 0.05? Pause for a moment and come up with the number. That is easy, 100 graphs times 0.05 (probability of false positive) gives us the expected 100 * 0.05 = 5 figures with false positives. BTW. If you got it, congratulations. If not compare the solution with the calculations we did in Section 4.5. Anyway, you decided that this will be your golden standard, i.e. no more than 5% (\(\frac{5}{100}\) = 0.05) of figures with false positives.

But here (in postHocPvals above) you got 3 comparisons and therefore 3 p-values. Imagine that you place such three results into a single figure. Now, the question is: under the conditions given above (all \(H_{0}\)s true, cutoff for \(\alpha\) = 0.05) how many graphs would contain false positives if you placed three such comparisons per graph for 100 figures? Think for a moment and come up with the number.

OK, so we got 100 graphs, each reporting 3 comparisons (3 p-values), which gives us in total 300 results. Out of them we expect 300 * 0.05 = 15 to be false positives. Now, we pack those 300 results into 100 figures. In the best case scenario the 15 false positives will land in the first five graphs (three false positives per graph, 5*3 = 15), the remaining 285 true negatives will land in the remaining 95 figures (three true negatives per graph, 95*3 = 285). The golden standard seems to be kept (5/100 = 0.05). The problem is that we don’t know which figures get the false positives. The Murphy’s law states: “Anything that can go wrong will go wrong, and at the worst possible time.” (or in the worst possible way). If so, then the 15 false positives will go to 15 different figures (1 false positive + 2 true negatives per graph), and the remaining 285 - 2*15 = 255 true negatives will go to the remaining 255/3 = 85 figures. Here, your golden standard (5% of figures with false positives) is violated (15/100 = 0.15).

This is why we cannot just leave the three postHocPvals as they are. We need to act, but what can we do to counteract the problem. Well, if the initial cutoff level for \(\alpha\) was 3 times smaller (0.05/3 = 0.017) then in the case above we would have 300 * (0.05/3) ≈ 5.0 false positives to put into 100 figures and everything would be OK even in the worst case scenario. Alternatively, since division is inverse operation to multiplication we could just multiply every p-value by 3 (number of comparisons) and check its significance at the cutoff level for \(\alpha\) = 0.05, like so

function adjustPvalue(pVal::Float64, by::Int)::Float64
    @assert (0 <= pVal <= 1) "pVal must be in range [0-1]"
    return min(1, pVal*by)
end

function adjustPvalues(pVals::Vector{Float64})::Vector{Float64}
    return adjustPvalue.(pVals, length(pVals))
end

# p-values for comparisons: spA vs spB, spA vs spC, and spB vs spC
adjustPvalues(postHocPvals)

[0.07533450421580626, 0.0011956335772937748, 0.14799658691976514]

Notice, the since on entry a p-value may be, let’s say, 0.6 then multiplying it by 3 would give us 1.8 which is an impossible value for probability (see Section 4.3.1). That is why we set the upper limit to 1 by using min(1, pVal*by). Anyway, after adjusting for multiple comparisons only one species differs from the other (spA vs spC, adjusted \(p-value \le 0.05\)). And this is our final conclusion.

The method we used above (in adjustPvalue and adjustPvalues) is called the Bonferroni correction. Probably it is the simplest method out there and it is useful if we have a small number of independent comparisons/p-values (let’s say up to 6). For a large number of comparisons you are likely to end up with a paradox:

one-way ANOVA (which controls the overall \(\alpha\) at the level of 0.05) indicates that there are some statistically significant differences,
the corrected p-values (which rely on different assumptions) show no significant differences.

Therefore, for large number of comparisons you may choose a different (less strict) method, e.g. the Benjamini-Hochberg procedure. Both of those (Bonferroni and Benjamini-Hochberg) are available in the MultipleTesting package. Observe

import MultipleTesting as Mt
# p-values for comparisons: spA vs spB, spA vs spC, and spB vs spC
resultsOfThreeAdjMethods = (
    adjustPvalues(postHocPvals),
    Mt.adjust(postHocPvals, Mt.Bonferroni()),
    Mt.adjust(postHocPvals, Mt.BenjaminiHochberg())
)

resultsOfThreeAdjMethods

([0.07533450421580626, 0.0011956335772937748, 0.14799658691976514],
 [0.07533450421580626, 0.0011956335772937748, 0.14799658691976514],
 [0.03766725210790313, 0.0011956335772937748, 0.049332195639921715])

As expected, the first two lines give the same results (since they both use the same adjustment method). The third line, and a different method, produces a different result (and hence yields distinctive interpretation).

A word of caution, you shouldn’t just apply 10 different adjustment methods on the obtained p-values and choose the one that produces the greatest number of significant differences. Instead you should choose a correction method a priori (up front, in advance) and stick to it later (make the final decision of which group(s) differ based on the adjusted p-values). Therefore, it takes some consideration to choose the multiplicity correction well.

OK, enough of theory, time for some practice. Whenever you’re ready click the right arrow to go to the exercises for this chapter.

5.5 Post-hoc tests ← → 5.7 Exercises - Comparisons ..

CC BY-NC-SA 4.0 Bartlomiej Lukaszuk