We started Section 6.3 with a fictitious eye color distribution [blue
and other
, rows (top-down) in the matrix below] in the US and UK [columns (left-right) in the matrix below].
mEyeColor
2×2 Matrix{Int64}:
220 161
279 320
But in reality there are more eye colors than just blue and other. For instance let’s say that in humans we got three types of eye color: blue, green, and brown. Let’s adjust our table for that:
# 3 x 2 table (DataFrame)
dfEyeColorFull = Dfs.DataFrame(
Dict(
# "other" from dfEyeColor is split into "green" and "brown"
"eyeCol" => ["blue", "green", "brown"],
"us" => [161, 78, 242],
"uk" => [220, 149, 130]
)
)
mEyeColorFull = Matrix{Int}(dfEyeColorFull[:, 2:3])
mEyeColorFull
3×2 Matrix{Int64}:
220 161
149 78
130 242
Can we say that the two populations differ (with respect to the eye color distribution) given the data in this table? Well, we can, that’s the job for … chi squared (\(\chi^2\)) test.
Wait, but I thought it is used to compare two proportions found in some samples. Granted, it could be used for that, but in broader sense it is a non-parametric test that determines the probability that the difference between the observed and expected frequencies (counts) occurred by chance alone. Here, non-parametric means it does not assume a specific underlying distribution of data (like the normal or binomial distribution we met before). As we learned in Section 6.3 the expected distribution of frequencies (counts) is assessed based on the data itself.
Let’s give it a try with our new data set (mEyeColorFull
) and compare it with the previously obtained results (for mEyeColor
from Section 6.3).
chi2testEyeColor = Ht.ChisqTest(mEyeColor)
chi2testEyeColorFull = Ht.ChisqTest(mEyeColorFull)
(
# chi^2 statistics
round(chi2testEyeColorFull.stat, digits = 2),
round(chi2testEyeColor.stat, digits = 2),
# p-values
round(chi2testEyeColorFull |> Ht.pvalue, digits = 7),
round(chi2testEyeColor |> Ht.pvalue, digits = 7)
)
(64.76, 11.62,
0.0, 0.0006538)
That’s odd. All we did was to split the other
category from dfEyeColor
(and therefore mEyeColor
) into green
and brown
to create dfEyeColorFull
(and therefore mEyeColorFull
) and yet we got different \(\chi^2\) statistics, and different p-values. How come?
Well, because we are comparing different things (and different populations).
Imagine that in the case of dfEyeColor
(and mEyeColor
) we actually compare not the eye color, but currency of both countries. So, we change the labels in our table. Instead of blue
we got heads
and instead of other
we got tails
and instead of us
we got eagle and instead of uk
we got one pound. We want to test if the proportion of heads/tails is roughly the same for both the coins.
Whereas in the case of dfEyeColorFull
(and mEyeColorFull
) imagine we actually compare not the eye color, but three sided dice produced in those countries. So, we change the labels in our table. Instead of blue
we got 1
and instead of green
we got 2
, instead of brown
we got 3
(1
, 2
, 3
is a convention, equally well one could write on the sides of a dice, e.g. Tom
, Alice
, and John
). We want to test if the distribution of 1
s, 2
s, and 3
s is roughly the same for both types of dice.
Now, it so happened that the number of dice throws was the same that the number of coin tosses from the example above. It also happened that the number of 1
s was the same as the number of head
s from the previous example. Still, we are comparing different things (coins and dices) and so we would not expect to get the same results from our chi squared (\(\chi^2\)) test. And that is how it is, the test is label blind. All it cares is the difference between the observed and expected frequencies (counts).
Anyway, the value of \(\chi^2\) statistic for mEyeColorFull
is 64.76 and the probability that such a value occurred by chance approximates 0. Therefore, it is below our customary cutoff level of 0.05, and we may conclude that the populations differ with respect to the distribution of eye color (as we did in Section 6.5).
Now, let’s get back for a moment to the label blindness issue. The test may be label blind, but we are not. It is possible that sooner or later you will come across a data set where splitting groups into different categories will lead you to different conclusions, e.g. p-value from \(\chi^2\) test for mEyeColorPlSp
for Poland and Spain would be 0.054, and for mEyeColorPlSpFull
it would be 0.042 (so it is and it isn’t statistically different at the same time). What should you do then?
Well, it happens. There is not much to be done here. We need to live with that. It is like the accused and judge analogy from Section 4.7.5. In reality the accused is guilty or not. We don’t know the truth, the best we can do is to examine the evidence. After that one judge may incline to declare the accused guilty the other will give him the benefit of doubt. There is no certainty or a great solution here (at least I don’t know it). In such a case some people suggest to present both the results with the author’s conclusions and let the readers decide for themselves. Others suggest to collect a greater sample to make sure which conclusion is right. Still, others suggest that you should plan your experiment (its goals and the ways to achieve them) carefully beforehand. Once you got your data you stick to the plan even if the result is disappointing to you. So, if we had decided to compare blue
vs other
and failed to establish the statistical significance we ought stopped there. We should not go fishing for statistical significance by splitting other
to green
and brown
.