Probability - theory and practice - Romeo and Julia, where Romeo is Basic Statistics

4.4 Probability - theory and practice

OK, in the previous chapter (see Section 4.3) we said that a person with blood group AB would produce gametes A and B with probability 50% (p = \(\frac{1}{2}\) = 0.5) each. A reference value for sperm count is 16’000’000 per mL or 16’000 per \(\mu L\). Given that last value, we would expect 8’000 cells (16’000 * 0.5) to contain allele A and 8’000 (16’000 * 0.5) cells to contain allele B.

Let’s put that to the test.

Wait! Hold your horses! We’re not going to take biological samples. Instead we will do a computer simulation.

import Random as Rand
Rand.seed!(321) # optional, needed for reproducibility
gametes = Rand.rand(["A", "B"], 16_000)
first(gametes, 7)

["B", "A", "B", "A", "B", "A", "A"]

First, we import a package to generate random numbers (import Random as Rand). Then we set seed to some arbitrary number (Rand.seed!(321)) in order to reproduce the results see the docs. Thanks to the above you should get the exact same result as I did (assuming you’re using the same version of Julia). Then we draw 16’000 gametes out of two available (gametes = Rand.rand(["A", "B"], 16_000)) with function rand (drawing with replacement) from Random library (imported as Rand). Finally, since looking through all 16’000 gametes is tedious we display only first 7 (first(gametes, 7)) to have a sneak peak at the result.

Let’s write a function that will calculate the number of gametes for us.

function getCounts(v::Vector{T})::Dict{T,Int} where T
    counts::Dict{T,Int} = Dict()
    for elt in v
        if haskey(counts, elt) #1
            counts[elt] = counts[elt] + 1 #2
        else #3
            counts[elt] = 1 #4
        end #5
    end
    return counts
end

Try to figure out what happened here on your own. If you need a refresher on dictionaries in Julia see Section 3.5.3 or the docs.

Briefly, first we initialize an empty dictionary (counts::Dict{T,Int} = Dict()) with keys of some type T (elements of that type compose the vector v). Next, for every element (elt) in the vector v we check if it is present in the counts (if haskey(counts, elt)). If it is we add 1 to the previous count (counts[elt] = counts[elt] + 1). If not (else) we put the key (elt) into the dictionary with count 1. In the end we return the result (return counts). The if ... else block (lines with comments #1-#5) could be replaced with one line (counts[elt] = get(counts, elt, 0) + 1), but I thought the more verbose version would be easier to understand.

Let’s test it out.

gametesCounts = getCounts(gametes)
gametesCounts

Dict{String, Int64} with 2 entries:
  "B" => 8082
  "A" => 7918

Hmm, that’s odd. We were suppose to get 8’000 gametes with allele A and 8’000 with allele B. What happened? Well, reality. After all “All models are wrong, but some are useful”. Our theoretical reasoning was only approximation of the real world and as such cannot be precise (although with greater sample sizes comes greater precision). For instance, you can imagine that a fraction of the gametes were damaged (e.g. due to some unspecified environmental factors) and underwent apoptosis (aka programmed cell death). So that’s how it is, deal with it.

OK, let’s see what are the experimental probabilities we got from our hmm… experiment.

function getProbs(counts::Dict{T, Int})::Dict{T,Float64} where T
    total::Int = sum(values(counts))
    return Dict(k => v/total for (k, v) in counts)
end

First we calculate total counts no matter the gamete category (sum(values(counts))). Then we use a dictionary comprehension, similar to the comprehension we met before (see Section 3.6.3). Briefly, for each key and value in counts (for (k,v) in counts) we create the same key in a new dictionary with a new value being the proportion of v in total (k => v/total).

And now the experimental probabilities.

gametesProbs = getProbs(gametesCounts)
gametesProbs

Dict{String, Float64} with 2 entries:
  "B" => 0.505125
  "A" => 0.494875

One last point. While writing numerous programs I figured out it is sometimes better to represent things (internally) as numbers and only in the last step present them in a more pleasant visual form to the viewer (this way may be faster computationally). In our case we could have used 0 as allele A and 1 as allele B like so.

Rand.seed!(321)
gametes = Rand.rand([0, 1], 16_000)
first(gametes, 7)

[1, 0, 1, 0, 1, 0, 0]

Then to get the counts of the alleles I could type:

alleleBCount = sum(gametes)
alleleACount = length(gametes) - alleleBCount
(alleleACount, alleleBCount)

(7918, 8082)

And to get the probabilities for the alleles I could simply type:

alleleBProb = sum(gametes) / length(gametes)
alleleAProb = 1 - alleleBProb
(round(alleleAProb, digits=6), round(alleleBProb, digits=6))

(0.494875, 0.505125)

Go ahead. Compare the numbers with those that you got previously and explain it to yourself why this second approach works. Once you’re done click the right arrow to explore probability distributions in the next section.

Note: Similar functionality to getCounts and getProbs can be found in StatsBase.jl, see: countmap and proportionmap.

4.3 Probability - properties ← → 4.5 Probability distribution

CC BY-NC-SA 4.0 Bartlomiej Lukaszuk