6.7 Exercises - Comparisons of Categorical Data

Just like in the previous chapters here you will find some exercises that you may want to solve to get from this chapter as much as you can (best option). Alternatively, you may read the task descriptions and the solutions (and try to understand them).

6.7.1 Exercise 1

In Section 6.3 and Section 6.5 we dealt with dfEyeColor and dfEyeColorFull, i.e. the data sets that were already in the form of a contingency table. Usually, this is not the case.

Imagine that you are a researcher and you want to find out if certain professions are associated with a greater risk of smoking cigarettes (perhaps as a way to alleviate the stress). So you prepare a questionnaire. People answer two questions: “Q1. What is your profession?” and “Q2. Do you smoke?”. The answers to Q1 are placed in one column of a spreadsheet, the answers to Q2 are placed into another column. An exemplary data could look this way:

import Random as Rand

Rand.seed!(321)
smoker = Rand.rand(["no", "yes"], 100)
profession = Rand.rand(["Lawyer", "Priest", "Teacher"], 100)

Write a function with the following signature

function getContingencyTable(
    rowVect::Vector{String},
    colVect::Vector{String},
    )::Matrix{Int}

The function should take two arguments (observations as vectors of strings) and return a contingency table (Matrix{Int}) with the counts (similar to mEyeColor or mEyeColorFull). You may modify the function slightly, e.g to return Dfs.DataFrame similar to the one produced by FreqTables.freqtable (it doesn’t have to be exact).

Test your function with the data presented above. Make sure it works properly also for smaller data sets, i.e.

Rand.seed!(321)
smokerSmall = Rand.rand(["no", "yes"], 10)
professionSmall = Rand.rand(["Lawyer", "Priest", "Teacher"], 10)

Here, the contingency table should contain zeros in some cells.

Below you may find a list of functions that I found useful (you may check them in the docs). Of course you don’t have to use any of them. The functions are sorted alphabetically.

6.7.2 Exercise 2

In Section 6.3 we concluded that the populations of the us and uk differ with respect to eye color distribution (we used data from mEyeColor).

Still, it’s often nice to know not just the numbers themselves, but the proportions (or percentage distribution of the data in a table).

So, here is a task for you. Write the following functions

function getColPerc(m::Matrix{Int})::Matrix{Float64}

# and

function getRowPerc(m::Matrix{Int})::Matrix{Float64}

that should work similarly to FreqTables.prop (prop(tbl2, margins=2), and prop(tbl2, margins=1)), i.e they should return the column and row percentage of observations, respectively.

To reduce code duplication you may want to combine them into a single function, e.g. getPerc(m::Matrix{Int}, byRow::Bool)::Matrix{Float64} that returns row percentages when byRow is true, and column percentages otherwise. You my also want to round the numbers (percents) to e.g. 2 decimal points.

In my solution I used nested for loops, but feel free to write it whatever way you like (as long as it works fine).

6.7.3 Exercise 3

The functions we developed previously (see Section 6.8.2) are nice and useful. Still, we might want to have a visual aid to help us with the interpretation of our data.

So here is another task for you. Using CairoMakie or your favorite plotting library write a function that accepts a data frame like dfEyeColorFull and draws a stacked bar plot depicting column percentages (search the documentation for barplot).

You may use the functions we developed before.

If you want, you can make your function also draw row percentages (optional).

6.7.4 Exercise 4

This exercise is pretty easy and straightforward. In Section 6.4 we said that the chi squared (\(\chi^2\)) test requires the table to fulfill a few assumptions, e.g.:

So here is the task. Write a function with the following signature

runCategTestGetPVal(m::Matrix{Int})::Float64
# or
runCategTestGetPVal(df::Dfs.DataFrame)::Float64

The function takes a 2x2 matrix (like mEyeColor or mEyeColorSmall) or a data frame (like dfEyeColor). Then the function tests the above mentioned assumptions and runs Ht.ChisqTest or Ht.FisherExactTest on its input and returns the obtained p-value. Feel free to use the functionalities we developed in this chapter (Section 6) and its sub-chapters.

6.7.5 Exercise 5

In Section 6.6 we analyzed the data in dfEyeColorFull (alternatively mEyeColorFull) and concluded that the distribution of eye color between the two tested countries differed. Still, we were unable to tell which (two eye colors) distributions differ from each other.

So here is the task. Write a function that accepts a matrix (or a data frame if you will) like mEyeColor/dfEyeColorFull (where the number of rows and/or columns with counts is greater than 2). The function should return a vector of all possible 2x2 matrices/data frames (I found getUniquePairs from Section 5.8.4 to be useful here, but you may use whatever you want).

Once you got the data structure with the data frames write another function that runs the appropriate test (runCategTestGetPVal from Section 6.7.4 above) on each of the matrices/data frames from the previous paragraph and return the p-values (choose the appropriate data structure).

In the last step write a function that applies a multiplicity correction (see Section 5.6) to the obtained p-values.

6.7.6 Exercise 6

Too cool down let’s end this chapter with something easy but potentially useful.

As you have learned by now in programming we often end up using our old functions (or at least I do), although we tend to tweak them a little to adjust them to the ever changing needs.

In this task I want you to change the drawColPerc from Section 6.8.3 (or your own solution to Section 6.7.3). You can name the new function, e.g. drawColPerc2 (wow, how original). The new function should accept among others a bigger data frame (like dfEyeColorFull). Inside it runs runCategTestsGetPVals we developed in Section 6.8.5 (with multiplicity correction). Then it should draw the stacked barplots (it draws one stacked barplot for each data frame, the drawings should be set in one column, but in multiple rows, so a graph under a graph). If the distribution in a data frame is statistically significant add a stroke (strokewidth argument) to the barplot.



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk