Test for independence - Romeo and Julia, where Romeo is Basic Statistics

6.6 Test for independence

Another way to look at the chi squared (\(\chi^2\)) test is that this is a test that allows to check the independence of the distribution of the data between the rows and columns (see the assumption we made when calculating the expected counts with probsUnderH0 in Section 6.3). Let’s make this more concrete with the following example.

Previously we concerned ourselves with the mEyeColorFull table.

mEyeColorFull

3×2 Matrix{Int64}:
 220  161
 149   78
 130  242

The rows contain (top to bottom) eye colors: blue, green, and brown. The columns (left to right) are for us and uk.

Interestingly enough, the eye color depends on the concentration of melanin, a pigment that is also present in skin and hair and protects us from the harmful UV radiation. So imagine that the columns contain the data for some skin condition (left column: diseaseX, right column: noDiseaseX). Now, we are interested to know, if people with a certain eye color are more exposed (more vulnerable) to the disease (if so then some preventive measures, e.g. a stronger sun screen, could be applied by them).

Since this is a fictitious data set on which we only changed the column labels then we already know the answer (see the reminder from Section 6.5 below)

(
    round(chi2testEyeColorFull.stat, digits = 2),
    round(chi2testEyeColorFull |> Ht.pvalue, digits = 7)
)

(64.76, 0.0)

OK, so based on the (fictitious) data there is enough evidence to consider that the occurrence of diseaseX isn’t independent from eye color (\(p \le 0.05\)). In other words, people of some eye color get diseaseX more often than people with some other eye color. But which eye color (blue, green, brown) carries the greater risk? Pause for a moment and think how to answer the question.

Well, one thing we could do is to collapse some rows (if it makes sense), for instance we could collapse green and brown into other category (we would end up with two eye colors: blue and other). So in practice we would answer the same question that we did in Section 6.3 for mEyeColor (of course here we changed column labels to diseaseX and noDiseaseX).

rowPerc = [r[1] / sum(r) * 100 for r in eachrow(mEyeColor)]
rowPerc = round.(rowPerc, digits = 2)

(
    round(chi2testEyeColor.stat, digits = 2),
    round(chi2testEyeColor |> Ht.pvalue, digits = 7),
    rowPerc
)

(11.62, 0.0006538, [57.74, 46.58])

We see that roughly 57.74% of blue eyed people got diseaseX compared to roughly 46.58% of people with other eye color and that the difference is statistically significant (\(p \le 0.05\)). So people with other eye color should be more careful with exposure to sun (of course, these are just made up data).

Another option is to use a method analogous to the one we applied in Section 5.4 and Section 5.5. Back then we compared three groups of continuous variables with one-way ANOVA [it controls for the overall \(\alpha\) (type 1 error)]. Then we used a post-hoc tests (Student’s t-tests) to figure out which group(s) differ(s) from the other(s). Naturally, we could/should adjust the obtained p-values by using a multiplicity correction (as we did in Section 5.6). This is exactly what we are going to do in the upcoming exercises (see Section 6.7.5 and Section 6.7.6). For now take some rest and click the right arrow when you’re ready.

6.5 Bigger table ← → 6.7 Exercises - Comparisons ..

CC BY-NC-SA 4.0 Bartlomiej Lukaszuk