6.3 Chi squared test

We finished the previous section by comparing the proportion of subjects with some feature to the reference population. For that we used Htests.BinomialTest. As we learned in Section 4.6 the word binomial means two names. Those names could be anything, like heads and tails, victory and defeat, but most generally they are called success and failure (success when an event occurred and failure when it did not). We can use a to denote individuals with the feature of interest and b to denote the individuals without that feature. In that case n is the total number of individuals (here, individuals with either a or b). That means that by doing Htests.BinomialTest we compared the sample fraction (e.g. \(\frac{a}{n}\) or equivalently \(\frac{a}{a+b}\)) with the assumed fraction of individuals with the feature of interest in the general population.

Now, imagine a different situation. You take the samples from two populations, and observe the eye color of people. You want to know if the percentage of people with blue eyes in the two populations is similar. If it is, then you may deduce they are closely related (perhaps one stems from the other). Let’s not look too far, let’s just take the population of the US and UK. Inspired by the Wikipedia’s page from the link above and supported by the random number generator in Julia I came up with the following counts.

import DataFrames as Dfs

dfEyeColor = Dfs.DataFrame(
    Dict(
        "eyeCol" => ["blue", "any"],
        "us" => [161, 481],
        "uk" => [220, 499]
    )
)
Table 6: Eye color distribution in two samples (fictitious data).
eyeCol uk us
blue 220 161
any 499 481

Here, we would like to compare if the two proportions (\(\frac{a_1}{n_1} = \frac{161}{481}\) and \(\frac{a_2}{n_2} = \frac{220}{499}\)) are roughly equal (\(H_0\): they come from the same population with some fraction of blue eyed people). Unfortunately, one look into the docs and we see that we cannot use Htests.BinomialTest (the test compares sample with a population, here we got two samples to compare). But do not despair that’s the job for Htests.ChisqTest (see also this Wikipedia’s entry). First we need to change our data slightly, because the test requires a matrix (aka array from Section 3.3.7) with the following proportions in columns: \(\frac{a_1}{b_1}\) and \(\frac{a_2}{b_2}\) (b instead of n, where n = a + b). Let’s adjust our data for that.

# subtracting eye color "blue" from eye color "any"
dfEyeColor[2, 2:3] = Vector(dfEyeColor[2, 2:3]) .-
    Vector(dfEyeColor[1, 2:3])
# renaming eye color "any" to "other" (it better reflects current content)
dfEyeColor[2, 1] = "other"
dfEyeColor

# all the elements must be of the same (numeric) type
mEyeColor = Matrix{Int}(dfEyeColor[:, 2:3])
mEyeColor
2×2 Matrix{Int64}:
 220  161
 279  320

OK, we got the necessary data structure. Here, Matrix{Int}() closed over dfEyeColor[:, 2:3] extracts the needed part of the data frame and converts it to a matrix (aka array) of integers. And now for the \(\chi^2\) (chi squared) test.

Htests.ChisqTest(mEyeColor)
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.197958, 0.311226, 0.190817, 0.299999]
    point estimate:          [0.22449, 0.284694, 0.164286, 0.326531]
    95% confidence interval:
     [(0.193, 0.2595), (0.2501, 0.322), (0.1369, 0.196), (0.2903, 0.3649)]

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           0.0007

Details:
    Sample size:        980
    statistic:          11.616133413434031
    degrees of freedom: 1
    residuals:          [1.86677, -1.48881, -1.90138, 1.51641]
    std. residuals:     [3.40824, -3.40824, -3.40824, 3.40824]

OK, first of all we can see right away that the p-value is below the customary cutoff level of 0.05 or even 0.01. This means that the samples do not come from the same population (we reject \(H_{0}\)). More likely they came from the populations with different underlying proportions of blue eyed people. This could indicate for instance, that the population of the US stemmed from the UK (at least partially) but it has a greater admixture of other cultures, which could potentially influence the distribution of blue eyed people. Still, this is just an exemplary explanation, I’m not an anthropologist, so it may well be incorrect. Additionally, remember that the data is fictitious and was generated by me.

Anyway, I’m pretty sure You got the part with the p-value on your own, but what are some of the other outputs. Point estimates are the observed probabilities in each of the cells from mEyeColor. Observe

# total number of observations
nObsEyeColor = sum(mEyeColor)

chi2pointEstimates = [mEyeColor...] ./ nObsEyeColor
round.(chi2pointEstimates, digits = 6)
[0.22449, 0.284694, 0.164286, 0.326531]

The [mEyeColor...] flattens the 2x2 matrix (2 rows, 2 columns) to a vector (column 2 is appended to the end of column 1). The ./ nObsEyeColor divides the observations in each cell by the total number of observations.

95% confidence interval is a 95% confidence interval (who would have guessed) similar to the one explained in Section 5.2.1 for Htests.OneSampleTTest but for each of the point estimates in chi2pointEstimates. Some (over)simplify it and say that within those limits the true probability for this group of observations most likely lies.

As for the value under h_0 those are the probabilities of the observations being in a given cell of mEyeColor assuming \(H_0\) is true. But how to get that probabilities. Well, in a similar way to the method we met in Section 4.3. Back then we answered the following question: If parents got blood groups AB and O then what is the probability that a child will produce a gamete with allele A? The answer: proportion of children with allele A and then the proportion of their gametes with allele A (see Section 4.3 for details). We calculated it using the following formula

\(P(A\ in\ CG) = P(A\ in\ C) * P(A\ in\ gametes\ of\ C\ with\ A)\)

Getting back to our mEyeColor the expected probability of an observation falling into a given cell of the matrix is the probability of an observation falling into a given column times the probability of an observation falling into a given row. Observe

# cProbs - probability of a value to be found in a given column
cProbs = [sum(c) for c in eachcol(mEyeColor)] ./ nObsEyeColor
# rProbs - probability of a value to be found in a given row
rProbs = [sum(r) for r in eachrow(mEyeColor)] ./ nObsEyeColor

# probability of a value to be found in a given cell of mEyeColor
# under H_0 (the samples are from the same population)
probsUnderH0 = [cp * rp for cp in cProbs for rp in rProbs]
round.(probsUnderH0, digits = 6)
[0.197958, 0.311226, 0.190817, 0.299999]

Here, [cp * rp for cp in cProbs for rp in rProbs] is an example of nested for loops enclosed in a comprehension. Notice that in the case of this comprehension there is no comma before the second for (the comma is present in the long, non-comprehension version of nested for loops in the link above).

Anyway, note that since the calculations from Section 4.3 assumed the probability independence, then the same assumption is made here. That means that, e.g. a given person cannot be classified at the same time as the citizen of the US and UK since we would have openly violated the assumption (some countries allow double citizenship, so you should think carefully about the inclusion criteria for the categories). Moreover, the eye color also needs to be a clear cut.

Out of the remaining output we are mostly interested in the statistic, namely \(\chi^2\) (chi square) statistic. Under the null hypothesis (\(H_{0}\), both groups come from the same population with a given fraction of blue eyed individuals) the probability distribution for counts to occur is called \(\chi^2\) (chi squared) distribution. Next, we calculate \(\chi^2\) (chi squared) statistic for the observed result (from mEyeColor). Then, we obtain the probability of a statistic greater than that to occur by chance. This is similar to the F-Statistic (Section 5.4) and L-Statistic (Section 5.8.2) we met before. Let’s see this in practice

observedCounts = [mEyeColor...]
expectedCounts = probsUnderH0 .* nObsEyeColor
# the statisticians love squaring numbers, don't they
chi2Diffs = ((observedCounts .- expectedCounts) .^2) ./ expectedCounts
chi2Statistic = sum(chi2Diffs)

(
    observedCounts,
    round.(expectedCounts, digits = 4),
    round.(chi2Diffs, digits = 4),
    round(chi2Statistic, digits = 4)
)
([220, 279, 161, 320],
[193.999, 305.001, 187.001, 293.999],
[3.4848, 2.2166, 3.6152, 2.2995],
11.6161)

The code is rather self explanatory. BTW. You might have noticed that: a) statisticians love squaring numbers (differences), and b) there are some similarities to the calculations of expected values from Section 4.5. Anyway, now, we can use the \(\chi^2\) statistic to get the p-value, like so

import Distributions as Dsts

function getDf(matrix::Matrix{Int})::Int
    nRows, nCols = size(matrix)
    return (nRows - 1) * (nCols - 1)
end

# p-value
# alternative: Dsts.ccdf(Dsts.Chisq(getDf(mEyeColor)), chi2Statistic)
1 - Dsts.cdf(Dsts.Chisq(getDf(mEyeColor)), chi2Statistic) |>
    x -> round(x, digits = 4)

0.0007

So, the pattern is quite similar to what we did in the case of F-Distribution/Statistic in Section 5.7.2. First we created the distribution of interest with the appropriate number of the degrees of freedom (why only the degrees of freedom matter see the conclusion of Section 5.8.2). Then we calculated the probability of a \(\chi^2\) Statistic being greater than the observed one by chance alone and that’s it.



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk