Let’s start as we did before by checking the file size.
#in 'code_snippets' folder use "./translation/mrna_seq.txt"
#in 'transcription' folder use "./dna_seq_template_strand.txt"
filePath = "./code_snippets/translation/mrna_seq.txt"
filesize(filePath)
333
OK, it’s less than 1 KiB. Let’s read it and get a preview of the data.
mRna = open(filePath) do file
read(file, Str)
end
mRna[1:60]
auggcccuguggaugcgccuccugccccugcuggcgcugcuggcccucuggggaccugac
It looks good, no spaces between the letters, no newline (\n
) characters (we removed them previously in Section 17.2).
The only issue with mRna
is that codon2aa
dictionary contains the codons (keys) in uppercase letters. All we need to do is to uppercase the letters in mRna
using Julia’s function of the same name.
mRna = uppercase(mRna)
mRna[1:5]
AUGGC
OK, time to start the translation. First let’s translate a triplet of nucleotide bases (aka codon) to an amino acid.
# translates codon/triplet to amino acid IUPAC
function getAA(codon::Str)::Str
@assert length(codon) == 3 "codon must contain 3 nucleotide bases"
aaAbbrev::Str = get(codon2aa, codon, "???")
aaIUPAC::Str = get(aa2iupac, aaAbbrev, "???")
return aaIUPAC
end
The function is rather simple. It accepts a codon (like "AUG"
), next it translates a codon to an amino acid (abbreviated with 3 letters) using previously defined codon2aa
dictionary. Then the 3-letter abbreviation is coded with a single letter recommended by IUPAC (using aa2iupac
dictionary). If at any point no translation was found "???"
is returned.
Now, time for translation.
function translate(mRnaSeq::Str)::Str
len::Int = length(mRnaSeq)
@assert len % 3 == 0 "the number of bases must be multiple of 3"
aas::Vec{Str} = fill("", Int(len/3))
aaInd::Int = 0
codon::Str, aa::Str = "", ""
for i in 1:3:len
aaInd += 1
codon = mRnaSeq[i:(i+2)] # variable local to for loop
aa = getAA(codon) # variable local to for loop
if aa == "Stop"
break
end
aas[aaInd] = aa
end
return join(aas)
end
We begin with some checks for the sequence. Then, we define the vector aas
(aas
- amino acids) holding our result. We initialize it with empty strings using fill
function. We will assign the appropriate amino acids to aas
based on the aaInd
(aaInd
- amino acid index) which we increase with every iteration (aaInd += 1
). In for
loop we iterate over each consecutive index with which a triple begins (1:3:len
will return numbers as [1, 4, 7, 10, 13, ...]
). Every consecutive codon
(3 nucleotic bases) is obtained from mRNASeq
using indexing by adding +2
to the index that starts a triple (e.g. for i = 1, the three bases are at positions 1:3, for i = 4, those are 4:6, etc.). The codon
is used to obtain the corresponding amino acid (aa
) with getAA
. If the aa
is a so-called stop codon, then we immediately break
free of the loop. Otherwise, we insert the aa
to the appropriate place [aaInd
] in aas
. In the end we collapse the vector of strings into one long string with join
. It is as if we used the string concatenation operator *
on each element of aas
like so aas[1] * aas[2] * aas[3] * ...
. Notice, that e.g. "A" * ""
or "A" * "" * ""
is still "A"
. This effectively gets rid of any empty aas
elements that remained after reaching aa == "Stop"
and break
early.
Let’s see does it work.
protein = translate(mRna)
protein == expectedAAseq
true
Congratulations! You have successfully synthesized pre-pro-insulin, a product transformed into a hormon that saved many a live. It could be only a matter of time before you achieve something greater still.
The above (translate
) is not the only possible solution. For instance, if you are a fan of functional programming paradigm you may try something like.
import Base.Iterators.takewhile as takewhile
function translate2(mRnaSeq::Str)::Str
len::Int = length(mRnaSeq)
@assert len % 3 == 0 "the number of bases is not multiple of 3"
ranges::Vec{UnitRange{Int}} = map(:, 1:3:len, 3:3:len)
codons::Vec{Str} = map(r -> mRnaSeq[r], ranges)
aas::Vec{Str} = map(getAA, codons)
return takewhile(aa -> aa != "Stop", aas) |> join
end
We start by defining ranges
that will help us get particular codons
in the next step. For that purpose you take two sequences for start and end of a codon and glue them together with :
. For instance map(:, 1:3:9, 3:3:9)
roughly translates into map(:, [1, 4, 7], [3, 6, 9])
which yields [1:3, 4:6, 7:9]
, i.e. a vector of UnitRange{Int}
. A UnitRange{Int}
is a range composed of Int
s separated by one unit, so by 1, like in 4:6
([4, 5, 6]
after expansion) mentioned above. Next, we map over those ranges and use each one of them (r
) to get (->
) a respective codon (mRnaSeq[r]
). Then, we map over the codons
to get respective amino acids (getAA
). Finally, we move from left to right through the amiono acids (aas
) vector and take its elements (aa
) as long as they are not equal "Stop"
. As the last step, we collapse the result with join
to get one big string.
protein2 = translate2(mRna)
expectedAAseq == protein2
true
Works as expected. The functional solution (translate2
) often has fewer lines of code (8 vs 17 for translate2
and translate
, respectively). It also consists of a series of consecutive logical steps, which is quite nice. However, for a beginner (or someone that doesn’t know this paradigm well) it appears more enigmatic (and therefore off-putting). Moreover, in general it is expected to be a bit slower than the more imperative for loop version (translate
). This should be evident with long sequences [try BenchmarkTools.@benchmark translate($mRna^20)
vs BenchmarkTools.@benchmark translate2($mRna^20)
in the REPL (type it after julia>
prompt)].
Note.
$
(see above) is an interpolation of global variable recommended by the package. On the other hand,^
, (again see above) replicates a stringn
times, e.g."ab" ^ 3
="ababab"
. Interestingly, althoughtranslate(mRna^20)
andtranslate2(mRna^20)
receive a strand of RNA 20 times longer thanmRna
they still return the same amino acid sequence as before. Test yourself and explain why. This will also help you realize whytranslate2
is slower than its counterpart.
In the said case (with mRna^20
) the difference between \(\approx 27\ [\mu s]\) and \(\approx 140\ [\mu s]\) (on my laptop) shouldn’t be noticeable by a human (\(140\ [\mu s]\) is less than 1/1’000th of a second). Therefore, if the performance is acceptable you may want to go with the functional version.
Anyway, both Section 17 and Section 18 were inspired by the lecture of this ResearchGate entry based on which the DNA sequence was obtained (dna_seq_template_strand.txt
from Section 17).