7.5 Correlation Pitfalls

The Pearson correlation coefficient is pretty useful (especially in connection with the Student’s t-test), but it shouldn’t be applied thoughtlessly.

Let’s take a look at the Anscombe’s quartet.

import RDatasets as RD

anscombe = RD.dataset("datasets", "anscombe")
first(anscombe, 5)
Table 11: DataFrame for Anscombe’s quartet
X1 X2 X3 X4 Y1 Y2 Y3 Y4
10.0 10.0 10.0 8.0 8.04 9.14 7.46 6.58
8.0 8.0 8.0 8.0 6.95 8.14 6.77 5.76
13.0 13.0 13.0 8.0 7.58 8.74 12.74 7.71
9.0 9.0 9.0 8.0 8.81 8.77 7.11 8.84
11.0 11.0 11.0 8.0 8.33 9.26 7.81 8.47

The data frame is a part of RDatasets that contains a collection of standard data sets used with the R programming language. The data frame was carefully designed to demonstrate the perils of relying blindly on correlation coefficients.

fig = Cmk.Figure()
i = 0
for r in 1:2 # r - row
    for c in 1:2 # c - column
        i += 1
        xname = string("X", i)
        yname = string("Y", i)
        xs = anscombe[:, xname]
        ys = anscombe[:, yname]
        cor, pval = getCorAndPval(xs, ys)
        Cmk.scatter(fig[r, c], xs, ys,
            axis=(;
                title=string("Figure ", "ABCD"[i]),
                xlabel=xname, ylabel=yname,
                limits=(0, 20, 0, 15)
            ))
        Cmk.text!(fig[r, c], 9, 3,
            text="cor(x, y) = $(round(cor, digits=2))")
        Cmk.text!(fig[r, c], 9, 1,
            text="p-val = $(round(pval, digits=4))")
    end
end

fig

There’s not much to explain here. The only new part is string function that converts its elements to strings (if they aren’t already) and glues them together into a one long string. The rest is just plain drawing with CairoMakie. Still, take a look at the picture below

Figure 29: Anscombe’s Quartet.

All the sub-figures from Figure 29 depict different relation types between the X and Y variables, yet the correlations and p-values are the same. Two points of notice here. In Figure B the points lie in a perfect order on a curve. So, in a perfect word the correlation coefficient should be equal to 1. Yet it is not, as it only measures the spread of the points around an imaginary straight line. Moreover, correlation is sensitive to outliers. In Figure D the X and Y variables appear not to be associated at all (for X = 8, Y can take any value). Again, in the perfect world the correlation coefficient should be equal to 0. Still, the outlier on the far right (that in real life may have occurred by a typographical error) pumps it up to 0.82 (or what we could call a very strong correlation). Lesson to be learned here, don’t trust the numbers, and whenever you can draw a scatter plot to double check them. And remember, “All models are wrong, but some are useful”.

Other pitfalls are also possible. For instance, imagine you measured body and tail length of a certain species of mouse, here are your results.

# if you are in 'code_snippets' folder, then use: "./ch07/miceLengths.csv"
# if you are in 'ch07' folder, then use: "./miceLengths.csv"
miceLengths = Csv.read(
    "./code_snippets/ch07/miceLengths.csv",
    Dfs.DataFrame)
first(miceLengths, 5)
Table 12: Body lengths of a certain mouse species (fictitious data).
bodyCm tailCm sex
11.3 2.55 f
11.18 2.22 f
9.42 2.54 f
9.21 2.2 f
9.97 2.63 f

You are interested to know if the tail length is associated with the body length of the animals.

getCorAndPval(miceLengths.bodyCm, miceLengths.tailCm)
(0.8899347709623199, 1.5005298337200657e-7)

Clearly it is and even very strongly. Or is it? Well, let’s take a look

Figure 30: Mice body length vs. tail length.

It turns out that we have two clusters of points. In both of them the points seem to be randomly scattered. This could be confirmed by testing correlation coefficients for the clusters.

# fml - female mice lengths
# mml - male mice lengths
fml = miceLengths[miceLengths.sex .== "f", :] # choose only females
mml = miceLengths[miceLengths.sex .== "m", :] # choose only males

(
    getCorAndPval(fml.bodyCm, fml.tailCm),
    getCorAndPval(mml.bodyCm, mml.tailCm)
)
((-0.1593819718041706, 0.6821046994037891),
 (-0.02632446813765734, 0.9387606491398499))

Alternatively, you could read the documentation for the functionality built into DataFrames.jl to obtain the desired insight. Doing so takes time, effort, and causes irritation at first (trust me, I know). Still, there are no shortcuts to any place worth going. So, you may decide to use Dfs.groupby and Dfs.combine to get a similar result.

# gDf - grouped data frame
Dfs.groupby(miceLengths, :sex) |>
    gDf -> Dfs.combine(gDf, [:tailCm, :bodyCm] => Stats.cor => :r)
Table 13: Pearson correlation coefficients for miceLengths data frame.
sex r
f -0.1593819718041706
m -0.02632446813765729

Note: You could replace Stats.cor with getCorAndPval in the snippet above. This should work if you changed the signature of the function from getCorAndPval(v1::Vector{<:Real}, v2::Vector{<:Real}) to getCorAndPval(v1::AbstractVector{<:Real}, v2::Abstractvector{<:Real}) first. A more comprehensive DataFrames tutorial can be found, e.g. here (if you don’t know what to do with *.ipynb files then you may just click on any of them to see its content in a web browser).

Anyway, the Pearson correlation coefficients are small and not statistically significant (p > 0.05). But since the two clusters of points lie on the opposite corners of the graph, then the overall correlation measures their spread alongside the imaginary dashed line in Figure 30. This inflates the value of the coefficient (compare with the explanation for z1, z2 and jitter in Section 7.4). Therefore, it is always good to inspect a graph (scatter plot) to see if there are any clusters of points. The clusters are usually a result of some grouping present in the data (either different experimental groups/treatments or due to some natural grouping). Sometimes we may be unaware of the groups in our data set. Still, if we do know about them, then it is a good idea to inspect the overall correlation and the correlation coefficient for each of the groups separately.

As the last example let’s take a look at this data frame.

# if you are in 'code_snippets' folder, then use: "./ch07/candyBars.csv"
# if you are in 'ch07' folder, then use: "./candyBars.csv"
candyBars = Csv.read(
    "./code_snippets/ch07/candyBars.csv",
    Dfs.DataFrame)
first(candyBars, 5)
Table 14: Candy bar composition [g] (fictitious data).
total carb fat
44.49 30.23 9.67
48.39 29.31 12.48
49.83 30.95 10.58
40.51 25.22 9.89
44.51 29.45 10.15

Here, we got a data set on composition of different chocolate bars. You are interested to see if the carbohydrate (carb) content in bars is associated with their fat mass.

getCorAndPval(candyBars.carb, candyBars.fat)
(0.12176486958519653, 0.7375535843598793)

And it appears it is not. OK, no big deal, and what about carb and total mass of a candy bar?

getCorAndPval(candyBars.carb, candyBars.total)
(0.822779226943004, 0.0034638410860259317)

Now we got it. It’s big (r > 0.8) and it’s real (\(p \le 0.05\)). But did it really make sense to test that?

If we got a random variable aa then it is going to be perfectly correlated with itself.

Rand.seed!(321)
aa = Rand.rand(Dsts.Normal(100, 15), 10)

getCorAndPval(aa, aa)
(1.0, 0.0)

On the other hand it shouldn’t be correlated with another random variable bb.

bb = Rand.rand(Dsts.Normal(100, 15), 10)

getCorAndPval(aa, bb)
(0.19399997195558746, 0.5912393958185727)

Now, if we add the two variables together we will get the total (cc), that will be correlated with both aa and bb.

cc = aa .+ bb

(
    getCorAndPval(aa, cc),
    getCorAndPval(bb, cc)
)
((0.7813386818990972, 0.007608814877251735),
 (0.763829856046036, 0.01012008535535891))

This is because while correlating aa with cc we are partially correlating aa with itself (aa .+ bb). In general, the greater portion of cc our aa makes the greater the correlation coefficient. So, although possible, it makes little logical sense to compare a part of something with its total. Therefore, in reality running getCorAndPval(candyBars.carb, candyBars.total) makes no point despite the interesting result it seems to produce.



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk