Solution - Build SH*T with Julia

18.2 Solution

Let’s start as we did before by checking the file size.

#in 'code_snippets' folder use "./translation/mrna_seq.txt"
#in 'transcription' folder use "./dna_seq_template_strand.txt"
filePath = "./code_snippets/translation/mrna_seq.txt"
filesize(filePath)

333

OK, it’s less than 1 KiB. Let’s read it and get a preview of the data.

mRna = open(filePath) do file
    read(file, Str)
end
mRna[1:60]

auggcccuguggaugcgccuccugccccugcuggcgcugcuggcccucuggggaccugac

It looks good, no spaces between the letters, no newline (\n) characters (we removed them previously in Section 17.2).

The only issue with mRna is that codon2aa dictionary contains the codons (keys) in uppercase letters. All we need to do is to uppercase the letters in mRna using Julia’s function of the same name.

mRna = uppercase(mRna)
mRna[1:5]

AUGGC

OK, time to start the translation. First let’s translate a triplet of nucleotide bases (aka codon) to an amino acid.

# translates codon/triplet to amino acid IUPAC
function getAA(codon::Str)::Str
    @assert length(codon) == 3 "codon must contain 3 nucleotide bases"
    aaAbbrev::Str = get(codon2aa, codon, "???")
    aaIUPAC::Str = get(aa2iupac, aaAbbrev, "???")
    return aaIUPAC
end

The function is rather simple. It accepts a codon (like "AUG"), next it translates a codon to an amino acid (abbreviated with 3 letters) using previously defined codon2aa dictionary. Then the 3-letter abbreviation is coded with a single letter recommended by IUPAC (using aa2iupac dictionary). If at any point no translation was found "???" is returned.

Now, time for translation.

function translate(mRnaSeq::Str)::Str
    len::Int = length(mRnaSeq)
    @assert len % 3 == 0 "the number of bases must be multiple of 3"
    aas::Vec{Str} = fill("", Int(len/3))
    aaInd::Int = 0
    codon::Str, aa::Str = "", ""
    for i in 1:3:len
        aaInd += 1
        codon = mRnaSeq[i:(i+2)] # variable local to for loop
        aa = getAA(codon) # variable local to for loop
        if aa == "Stop"
            break
        end
        aas[aaInd] = aa
    end
    return join(aas)
end

We begin with some checks for the sequence. Then, we define the vector aas (aas - amino acids) holding our result. We initialize it with empty strings using fill function. We will assign the appropriate amino acids to aas based on the aaInd (aaInd - amino acid index) which we increase with every iteration (aaInd += 1). In for loop we iterate over each consecutive index with which a triple begins (1:3:len will return numbers as [1, 4, 7, 10, 13, ...]). Every consecutive codon (3 nucleotic bases) is obtained from mRNASeq using indexing by adding +2 to the index that starts a triple (e.g. for i = 1, the three bases are at positions 1:3, for i = 4, those are 4:6, etc.). The codon is used to obtain the corresponding amino acid (aa) with getAA. If the aa is a so-called stop codon, then we immediately break free of the loop. Otherwise, we insert the aa to the appropriate place [aaInd] in aas. In the end we collapse the vector of strings into one long string with join. It is as if we used the string concatenation operator * on each element of aas like so aas[1] * aas[2] * aas[3] * .... Notice, that e.g. "A" * "" or "A" * "" * "" is still "A". This effectively gets rid of any empty aas elements that remained after reaching aa == "Stop" and break early.

Let’s see does it work.

protein = translate(mRna)
protein == expectedAAseq

true

Congratulations! You have successfully synthesized pre-pro-insulin, a product transformed into a hormon that saved many a live. It could be only a matter of time before you achieve something greater still.

The above (translate) is not the only possible solution. For instance, if you are a fan of functional programming paradigm you may try something like.

import Base.Iterators.takewhile as takewhile

function translate2(mRnaSeq::Str)::Str
    len::Int = length(mRnaSeq)
    @assert len % 3 == 0 "the number of bases is not multiple of 3"
    ranges::Vec{UnitRange{Int}} = map(:, 1:3:len, 3:3:len)
    codons::Vec{Str} = map(r -> mRnaSeq[r], ranges)
    aas::Vec{Str} = map(getAA, codons)
    return takewhile(aa -> aa != "Stop", aas) |> join
end

We start by defining ranges that will help us get particular codons in the next step. For that purpose you take two sequences for start and end of a codon and glue them together with :. For instance map(:, 1:3:9, 3:3:9) roughly translates into map(:, [1, 4, 7], [3, 6, 9]) which yields [1:3, 4:6, 7:9], i.e. a vector of UnitRange{Int}. A UnitRange{Int} is a range composed of Ints separated by one unit, so by 1, like in 4:6 ([4, 5, 6] after expansion) mentioned above. Next, we map over those ranges and use each one of them (r) to get (->) a respective codon (mRnaSeq[r]). Then, we map over the codons to get respective amino acids (getAA). Finally, we move from left to right through the amiono acids (aas) vector and take its elements (aa) as long as they are not equal "Stop". As the last step, we collapse the result with join to get one big string.

protein2 = translate2(mRna)
expectedAAseq == protein2

true

Works as expected. The functional solution (translate2) often has fewer lines of code (8 vs 17 for translate2 and translate, respectively). It also consists of a series of consecutive logical steps, which is quite nice. However, for a beginner (or someone that doesn’t know this paradigm well) it appears more enigmatic (and therefore off-putting). Moreover, in general it is expected to be a bit slower than the more imperative for loop version (translate). This should be evident with long sequences [try BenchmarkTools.@benchmark translate($mRna^20) vs BenchmarkTools.@benchmark translate2($mRna^20) in the REPL (type it after julia> prompt)].

Note. $ (see above) is an interpolation of global variable recommended by the package. On the other hand, ^, (again see above) replicates a string n times, e.g. "ab" ^ 3 = "ababab". Interestingly, although translate(mRna^20) and translate2(mRna^20) receive a strand of RNA 20 times longer than mRna they still return the same amino acid sequence as before. Test yourself and explain why. This will also help you realize why translate2 is slower than its counterpart.

In the said case (with mRna^20) the difference between $\approx 27\ [\mu s]$ and $\approx 140\ [\mu s]$ (on my laptop) shouldn’t be noticeable by a human ($140\ [\mu s]$ is less than 1/1’000th of a second). Therefore, if the performance is acceptable you may want to go with the functional version.

Anyway, both Section 17 and Section 18 were inspired by the lecture of this ResearchGate entry based on which the DNA sequence was obtained (dna_seq_template_strand.txt from Section 17).

18.1 Problem ← → 19 Stem and Leaf Plot

CC BY-NC-SA 4.0 Bartlomiej Lukaszuk