The solution is pretty straightforward if you read through Example 1 and 3 in Section 18.1.1.
replace.(datesMMDDYYYY, r"(\d{2})\.(\d{2})\.(\d{4})" => s"\3-\1-\2")
[
"2025-01-04",
"2018-11-01",
"1999-12-31",
"2026-03-20"
]
We just go and capture the months (the first pair of digits, (\d{2}), followed by a literal dot, \.), the days (second pair of digits, (\d{2}), followed by a literal dot, \.) and the years (last four digits, (\d{4})). In the substitution string we reference them back in the appropriate order (\3, \1, and \2) separated by hyphens (-). So, we replaced the whole match ((\d{2})\.(\d{2})\.(\d{4})) with the remembered digits in the right order and with the right separators (\3-\1-\2). And that’s it. Finito.
txt = getTxtFromFile("./loremMail.txt")
"<<< " * txt[1:200] * " ... >>>"
<<< This is a lorem ipsum text from: https://en.wikipedia.org/wiki/Lorem_ipsum it contains a randomly placed fake e-mail addresses (kid of). It is used for educational purpose only.
Lorem ipsum dolor sit … >>>
Surprisingly, it seems that a proper regex for e-mail validation is pretty complex (see here). Still, we can go a far way with a much simpler one, which in our particular case should do the trick:
getAllMatches(eachmatch(r"[A-z0-9._\-]+@[A-z0-9._\-]+", txt)) |>
unique
[
"tom@write2me.com",
"potential_contact@hello.pl",
"another.contact@yyy.es",
"other-potential-contact@hello.pl",
"eve@write2me.com",
"eve2@write2me.com"
]
The regex is composed of a few parts, but mostly of [A-z0-9._\-]. It searches for:
A-z, an email may contain a capital letter, although in general, they are case-insensitive), or0-9), or. inside a positive character class is just a dot, although in general inside a regex it stands for any character except for newline), or_), or\-, likely we didn’t have to precede it with \ since it wasn’t between other 2 characters).This positive, character class must be repeated at least one time (+) before the @, symbol. On the other hand, the @ symbol must be followed by at least one (+) character class that we already discussed ([A-z0-9._\-]). Notice, that there is no need to add ?, after the + to make a non-greedy match. That is because the email addresses, are separated by one or more spaces and the positive character class, ([A-z0-9._\-]) does not include spaces.
Swapping the names is a piece of cake. We just use capture groups ((...)) that contain one or more (+) letters ([A-z]) per word and are separated by a white-space character. In the substitution string (s"") we use back-references in reversed order (\2 and \1) and separate them with a comma (","):
# random names
names = ["Mary Johnson", "Eve Smith", "Tom Brown"]
replace.(names, r"([A-z]+) ([A-z]+)" => s"\2, \1")
[
"Johnson, Mary",
"Smith, Eve",
"Brown, Tom"
]
Once we got the names formatted, sorting them shouldn’t be a problem either:
replace.(names, r"([A-z]+) ([A-z]+)" => s"\2, \1") |> sort
[
"Brown, Tom",
"Johnson, Mary",
"Smith, Eve"
]
OK, time for some more complicated names:
# random names
names = [
"Jane Johnson",
"Mary Jane Doe",
"Peter Smith",
"Adam Tom Brown"
]
Let’s build our regex step by step. We start by matching a middle name (if there is one).
eachmatch.(r" [A-z]+ ", names) .|> getAllMatches
[
[],
[" Jane "],
[],
[" Tom "]
]
Here we search for a word between two spaces, more specifically: a white-space character, at least one letter ([A-z]+) and a white-space character.
Time to abbreviate the middle name:
replace.(names, r" ([A-Z])[a-z]+ " => s" \1. ")
[
"Jane Johnson",
"Mary J. Doe",
"Peter Smith",
"Adam T. Brown"
]
For that we modified the previous regex. This time we looked for a white-space character, one capital letter ([A-Z]), at least one small letter ([a-z]+) and a white-space character. Out of the whole match (" ([A-Z])[a-z]+ ") we captured (()) and remembered only the capital letter, which we used in the substitution string (s"") followed by a literal dot (\1.). Therefore, we replaced the whole match (" ([A-Z])[a-z]+ ") by its first capture group that we memorized and referred back to with (\1).
Now, time for the swap:
replace.(names, r" ([A-Z])[a-z]+ " => s" \1. ") |>
abbrevNames -> replace.(abbrevNames, r"([A-z .]+) ([A-z]+)" => s"\2, \1")
[
"Johnson, Jane",
"Doe, Mary J.",
"Smith, Peter",
"Brown, Adam T."
]
Here, instead of being clever and building a one complicated regex, we just passed the result of one replace function as an input to another replace function. The second regex looks for at least one letter, space or literal dot ([A-z .]+, this captures as many consecutive words as it can because of the greediness) followed by one word ([A-z]+, one or more letters). We captured the words with () and swapped them with back-references (\2 and \1), while putting a comma (,) between them.
OK, now for the last step, sorting:
replace.(names, r" ([A-Z])[a-z]+ " => s" \1. ") |>
abbrevNames -> replace.(abbrevNames, r"([A-z .]+) ([A-z]+)" => s"\2, \1") |>
sort
[
"Brown, Adam T.",
"Doe, Mary J.",
"Johnson, Jane",
"Smith, Peter"
]
And we’re done.
OK, time for a tough challenge. Let’s properly format nums using only the regex techniques we learned so far (see Section 18.1.1) + some built-in Julia functions. My first try would look something like:
nums = [0, 1, 12, 123, 1234, 12345,
123456, 1234567, 12345678, 123456789]
replace.(string.(nums), r"(\d{3})" => s"\1,")
# no commas separating elts of vector, to make it more legible
[
"0"
"1"
"12"
"123,"
"123,4"
"123,45"
"123,456,"
"123,456,7"
"123,456,78"
"123,456,789,"
]
Overall, we did pretty good. First, we changed the integers (nums) into strings (by using string function). Next, we said: while moving left to right (default direction for a regex engine) match exactly three digits (\d{3}) and remember them ((...)). Finally, insert the remembered digits followed by a comma ("\1,"). There is a small problem, though. The triplets are matched starting from left side instead of the right (which we would prefer). Not a problem, we’ll just reverse the string before transformation (putting commas).
replace.(reverse.(string.(nums)), r"(\d{3})" => s"\1,")
# no commas separating elts of vector, to make it more legible
[
"0"
"1"
"21"
"321,"
"432,1"
"543,21"
"654,321,"
"765,432,1"
"876,543,21"
"987,654,321,"
]
OK, the commas are placed every three digits from right (if you consider the original numbers). Now we would like to remove the stray comma at the end of some lines (r",$" => "") and reverse the string again (to restore the original order):
replace.(reverse.(string.(nums)), r"(\d{3})" => s"\1,") |>
reversedNums -> reverse.(replace.(reversedNums, r",$" => ""))
# no commas separating elts of vector, to make it more legible
[
"0"
"1"
"12"
"123"
"1,234"
"12,345"
"123,456"
"1,234,567"
"12,345,678"
"123,456,789"
]
To make it slightly more elegant we can enclose the entire procedure into a function:
function fmtMoney(n::Int)::Str
@assert n >= 0 "n must be >= 0"
result::Str = replace(reverse(string(n)), r"(\d{3})" => s"\1,")
return replace(result, r",$" => "") |> reverse
end
And use it for money formatting:
fmtMoney.(nums) .* " USD"
[
"0 USD",
"1 USD",
"12 USD",
"123 USD",
"1,234 USD",
"12,345 USD",
"123,456 USD",
"1,234,567 USD",
"12,345,678 USD",
"123,456,789 USD"
]
The above fmtMoney is a five line regex equivalent of the fifteen lines long getFormattedMoney from Section 27.1.4. Likely, it could be shortened even more by applying lookahead assertions (we didn’t use them, since they had not been discussed in Section 18.1.1).
Now, let’s go one step further and try to format also decimals. For that, we’ll split a number into dollars (integers) and pennies (two digits after comma, rounded if necessary).
function getDollarsPennies(money::Flt)::Tuple{Int, Int}
@assert money >= 0 "money must be >= 0"
integralPart::Int = floor(Int, money)
decimalPart::Flt = money % 1
return (integralPart, round(Int, decimalPart*100))
end
Once we got it, we will use fmtMoney(n::Int)::Str to format the dollars to which we’ll append the pennies:
function fmtMoney(n::Flt)::Str
@assert n >= 0 "n must be >= 0"
dollars::Int, pennies::Int = getDollarsPennies(n)
result::Str = fmtMoney(dollars)
return string(result, ".", pennies)
end
Time to test it out:
nums = [0, 0.1, 1, 1.2, 12., 12.34, 123.456,
1234, 12345, 12345.67, 123456.7, 1234567.89]
fmtMoney.(nums) .* " USD"
[
"0.0 USD",
"0.10 USD",
"1.0 USD",
"1.20 USD",
"12.0 USD",
"12.34 USD",
"123.46 USD",
"1,234.0 USD",
"12,345.0 USD",
"12,345.67 USD",
"123,456.70 USD",
"1,234,567.89 USD"
]
Looks like we finished the job. Regular expressions are worth learning even at a relatively basic level. Sometimes they can really speed things up or reduce the amount of code.