7.4 Correlation

Correlation is most frequently expressed in the term of the Pearson correlation coefficient that by itself relies on covariance we met in the previous section. Its formula is pretty straightforward

# calculates the Pearson correlation coefficient
function getCor(v1::Vector{<:Real}, v2::Vector{<:Real})::Float64
    return getCov(v1, v2) / (Stats.std(v1) * Stats.std(v2))
end
getCor (generic function with 1 method)

Note: To calculate the Pearson correlation coefficient you may also use Statistics.cor.

The correlation coefficient is just the covariance (numerator) divided by the product of two standard deviations (denominator). The lowest absolute value (abs(getCov(v1, v2))) possible for covariance is 0. The maximum absolute value possible for covariance is equal to Stats.std(v1) * Stats.std(v2). Therefore, the correlation coefficient (often abbreviated as r) takes values from 0 to 1 for positive covariance and from 0 to -1 for negative covariance. The more tightly our points lie on an imaginary trend line the greater is abs(corCoef).

Let’s see how it works.

biomassCors = (
    getCor(biomass.plantAkg, biomass.rainL),
    getCor(biomass.plantAkg .* 2.205, biomass.rainL), # pounds
    getCor(biomass.plantBkg, biomass.rainL),
    getCor(biomass.plantBkg .* 2.205, biomass.rainL), # pounds
)
round.(biomassCors, digits = 2)
(0.78, 0.78, 0.53, 0.53)

Clearly, the new and improved coefficient is more useful than the old one (covariance). Large spread of points along the imaginary line in Figure 27 yields small correlation coefficient (closer to 0). Small spread of points on the other hand results in a high correlation coefficient (closer to -1 or 1). So, now we can be fairly sure of the greater strength of association between plantA and rainfall than plantB and the condition.

Importantly, the correlation coefficient depends not only on the scatter of points along an imaginary line, but also on the slope of the line. Observe:

import Random as Rand

Rand.seed!(321)

jitter = Rand.rand(-0.2:0.01:0.2, 10)
z1 = collect(1:10)
z2 = repeat([5], 10)
(
    getCor(z1 .+ jitter, z1), # / imaginary line
    getCor(z1, z2 .+ jitter) # - imaginary line
)
(0.9992378634323702, -0.3215268421510342)

Feel free to draw side by side scatter plots for the example above (remember to link the axes). In the code snippet above the spread of data points along the imaginary line is the same in both cases. Yet, the correlation coefficient is much smaller in the second case. This is because of the covariance that is present in the getCor function (in numerator). The covariance is greater when the points change together is a given direction. The change is smaller and non-systematic in the second case, hence the lower correlation coefficient. You may want to keep that in mind as it will become handy once we talk about correlation pitfalls in Section 7.5.

Anyway, the interpretation of the correlation coefficient differs depending on a textbook and a field of science, but in biology it is approximated by those cutoffs:

Note: The Pearson’s correlation coefficient is often abbreviated as r. Whereas, ] and ) signify closed and open interval, respectively. So, x in range [0, 1] means 0 <= x <= 1, whereas x in range [0, 1) means 0 <= x < 1.

In general, if x and y are correlated then this may mean one of a few things, the most obvious of which are:

We can protect ourselves (to a certain extent) against the last contingency with our good old Student’s T-test (see Section 5.2). As stated in the Wikipedia’s page:

[…] Pearson’s correlation coefficient follows Student’s t-distribution with degrees of freedom n − 2. Specifically, if the underlying variables have a bivariate normal distribution the variable

\(t = \frac{r}{\sigma_r} = r * \sqrt{\frac{n-2}{1-r^2}}\)

has a student’s t-distribution in the null case (zero correlation)

Let’s put that knowledge to good use:

# calculates the Pearson correlation coefficient and pvalue
# assumption (not tested in the function): v1 & v2 got normal distributions
function getCorAndPval(
    v1::Vector{<:Real}, v2::Vector{<:Real})::Tuple{Float64, Float64}
    r::Float64 = Stats.cor(v1, v2) # or: getCor(v1, v2)
    n::Int = length(v1) # num of points
    df::Int = n - 2
    t::Float64 = r * sqrt(df / (1 - r^2)) # t-statistics
    leftTail::Float64 = Dsts.cdf(Dsts.TDist(df), t)
    pval::Float64 = (t > 0) ? (1 - leftTail) : leftTail
    return (r, pval * 2) # (* 2) two-tailed probability
end
getCorAndPval (generic function with 1 method)

The function is just a translation of the formula given above + some calculations similar to those we did in Section 5.2 to get the p-value. And now for our correlations.

biomassCorsPvals = (
    getCorAndPval(biomass.plantAkg, biomass.rainL),
    getCorAndPval(biomass.plantAkg .* 2.205, biomass.rainL), # pounds
    getCorAndPval(biomass.plantBkg, biomass.rainL),
    getCorAndPval(biomass.plantBkg .* 2.205, biomass.rainL), # pounds
)
biomassCorsPvals
((0.7820227869193526, 4.635013786202791e-5),
 (0.7820227869193522, 4.635013786202791e-5),
 (0.526545847035062, 0.017073389709765907),
 (0.5265458470350619, 0.017073389709765907))

We can see that both correlation coefficients are unlikely to have occurred by chance alone (\(p \le 0.05\)). Therefore, we can conclude that in each case the biomass is associated with the amount of water a plant receives. I don’t know a formal test to compare two correlation coefficients, but based on the rs alone it appears that the biomass of plantA is more tightly related to (or maybe even it relies more on) the amount of water than the other plant (plantB).



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk