6.5 Bigger table

We started Section 6.3 with a fictitious eye color distribution [blue and other, rows (top-down) in the matrix below] in the US and UK [columns (left-right) in the matrix below].

mEyeColor
2×2 Matrix{Int64}:
 220  161
 279  320

But in reality there are more eye colors than just blue and other. For instance let’s say that in humans we got three types of eye color: blue, green, and brown. Let’s adjust our table for that:

# 3 x 2 table (DataFrame)
dfEyeColorFull = Dfs.DataFrame(
    Dict(
        # "other" from dfEyeColor is split into "green" and "brown"
        "eyeCol" => ["blue", "green", "brown"],
        "us" => [161, 78, 242],
        "uk" => [220, 149, 130]
    )
)

mEyeColorFull = Matrix{Int}(dfEyeColorFull[:, 2:3])
mEyeColorFull
3×2 Matrix{Int64}:
 220  161
 149   78
 130  242

Can we say that the two populations differ (with respect to the eye color distribution) given the data in this table? Well, we can, that’s the job for … chi squared (\(\chi^2\)) test.

Wait, but I thought it is used to compare two proportions found in some samples. Granted, it could be used for that, but in broader sense it is a non-parametric test that determines the probability that the difference between the observed and expected frequencies (counts) occurred by chance alone. Here, non-parametric means it does not assume a specific underlying distribution of data (like the normal or binomial distribution we met before). As we learned in Section 6.3 the expected distribution of frequencies (counts) is assessed based on the data itself.

Let’s give it a try with our new data set (mEyeColorFull) and compare it with the previously obtained results (for mEyeColor from Section 6.3).

chi2testEyeColor = Ht.ChisqTest(mEyeColor)
chi2testEyeColorFull = Ht.ChisqTest(mEyeColorFull)

(
    # chi^2 statistics
    round(chi2testEyeColorFull.stat, digits = 2),
    round(chi2testEyeColor.stat, digits = 2),

    # p-values
    round(chi2testEyeColorFull |> Ht.pvalue, digits = 7),
    round(chi2testEyeColor |> Ht.pvalue, digits = 7)
)
(64.76, 11.62,
0.0, 0.0006538)

That’s odd. All we did was to split the other category from dfEyeColor (and therefore mEyeColor) into green and brown to create dfEyeColorFull (and therefore mEyeColorFull) and yet we got different \(\chi^2\) statistics, and different p-values. How come?

Well, because we are comparing different things (and different populations).

Imagine that in the case of dfEyeColor (and mEyeColor) we actually compare not the eye color, but currency of both countries. So, we change the labels in our table. Instead of blue we got heads and instead of other we got tails and instead of us we got eagle and instead of uk we got one pound. We want to test if the proportion of heads/tails is roughly the same for both the coins.

Whereas in the case of dfEyeColorFull (and mEyeColorFull) imagine we actually compare not the eye color, but three sided dice produced in those countries. So, we change the labels in our table. Instead of blue we got 1 and instead of green we got 2, instead of brown we got 3 (1, 2, 3 is a convention, equally well one could write on the sides of a dice, e.g. Tom, Alice, and John). We want to test if the distribution of 1s, 2s, and 3s is roughly the same for both types of dice.

Now, it so happened that the number of dice throws was the same that the number of coin tosses from the example above. It also happened that the number of 1s was the same as the number of heads from the previous example. Still, we are comparing different things (coins and dices) and so we would not expect to get the same results from our chi squared (\(\chi^2\)) test. And that is how it is, the test is label blind. All it cares is the difference between the observed and expected frequencies (counts).

Anyway, the value of \(\chi^2\) statistic for mEyeColorFull is 64.76 and the probability that such a value occurred by chance approximates 0. Therefore, it is below our customary cutoff level of 0.05, and we may conclude that the populations differ with respect to the distribution of eye color (as we did in Section 6.5).

Now, let’s get back for a moment to the label blindness issue. The test may be label blind, but we are not. It is possible that sooner or later you will come across a data set where splitting groups into different categories will lead you to different conclusions, e.g. p-value from \(\chi^2\) test for mEyeColorPlSp for Poland and Spain would be 0.054, and for mEyeColorPlSpFull it would be 0.042 (so it is and it isn’t statistically different at the same time). What should you do then?

Well, it happens. There is not much to be done here. We need to live with that. It is like the accused and judge analogy from Section 4.7.5. In reality the accused is guilty or not. We don’t know the truth, the best we can do is to examine the evidence. After that one judge may incline to declare the accused guilty the other will give him the benefit of doubt. There is no certainty or a great solution here (at least I don’t know it). In such a case some people suggest to present both the results with the author’s conclusions and let the readers decide for themselves. Others suggest to collect a greater sample to make sure which conclusion is right. Still, others suggest that you should plan your experiment (its goals and the ways to achieve them) carefully beforehand. Once you got your data you stick to the plan even if the result is disappointing to you. So, if we had decided to compare blue vs other and failed to establish the statistical significance we ought stopped there. We should not go fishing for statistical significance by splitting other to green and brown.



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk