6.2 Flashback

We deal with categorical data when a variable can take a value from a small set of values. Each element of the set is clearly distinct from the other elements. For instance the results of coin tosses or dice rolls fall into one of a few distinctive categories. As stated in Section 4 and its subsections the result of a coin toss often displays the binomial distribution. In line with that notion, in Exercise 3 (see Section 4.8.3 and Section 4.9.3) we calculated the probability that Peter is a better tennis player than John if he won 5 games out of 6. The two-tailed probability was roughly equal to 0.22. Once we know the logic behind the calculations (see Section 4.9.3) we can fast forward to the solution with Htests.BinomialTest like so

import HypothesisTests as Htests

Htests.BinomialTest(5, 6, 0.5)
# or just: Htests.BinomialTest(5, 6)
# since 0.5 is the default prob. for the population
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.833333
    95% confidence interval: (0.3588, 0.9958)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.2188

Details:
    number of observations: 6
    number of successes:    5

Works like a charm. Don’t you think. Here we got a two-tailed p-value. By oversimplifying stuff we can say that the 95% confidence interval is an estimate of the true probability of Peter’s victory in a game (from data it is 5/6 = 0.83) and it includes 0.5 (our probability under \(H_{0}\) = 0.5). I leave the rest of the output to decipher to you (as a mini-exercise).

In general Htests.BinomialTest is useful when you want to compare the obtained experimental result that may fall into one of two categories (generally called: success or failure) with a theoretical binomial distribution with a known probability of success (we check if the obtained result is compatible with that distribution). If we interpret this statement in a more creative way we may find other use cases for the test.

Let’s look at an interesting example from the field of biological sciences. Imagine that there is some disease that you want to study. Its prevalence in the general population is estimated to be ≈ \(\frac{10}{100}\) = 0.1 = 10% . You happened to found a human population on a desert island and noticed that 519 adults out of 3’202 suffer from the disease of interest. You run the test to see if that differs from the general population [here success (if I may call it so) is the presence of the disease, and theoretical distribution is the distribution of the disease in the general population].

Htests.BinomialTest(519, 3202, 0.1)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.1
    point estimate:          0.162086
    95% confidence interval: (0.1495, 0.1753)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-26

Details:
    number of observations: 3202
    number of successes:    519

And it turns out that it does. Congratulations, you discovered a local population with a different, clearly higher prevalence of the disease. Now you (or other people) can study the population closer (e.g. gene screening) in order to find the features that trigger the onset of (or predispose to develop) the disease.

The story is not that far fetched since there are human populations that are of particular interest to scientists due to their unusually common occurrence of some diseases (e.g. the Akimel O’odham and their high prevalence of type 2 diabetes).



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk