18.1 Problem

18.1.1 Regex Intro

Note: This subsection provides a short description of regular expressions. You may skip it if you know what a regex is. In that case go to the task specification right away (see Section 18.1.2).

Imagine you work at a police station that happened to arrest a John Smith who is a suspect in a certain case. In your country the identity of an accused person is to be protected from public, so your job is to obfuscate any mention of him from the press release.

function getTxtFromFile(filePath::Str)::Str
    fileTxt::Str = ""
    try
        fileTxt = open(filePath) do file
            read(file, Str)
        end
    catch
        fileTxt = "Can't read '$filePath'. Make sure it exists."
    end
    return fileTxt
end

txt = getTxtFromFile("./loremJohnSmith.txt")
"<<< " * txt[1:200] * " ... >>>"

<<< This is a lorem ipsum text from: https://en.wikipedia.org/wiki/Lorem_ipsum it contains a randomly placed name John Smith. It is used for educational purpose only.

Lorem ipsum dolor sit John Smith ame … >>>

This could be done, e.g. by replacing his last name with its first letter, but it’s kind of tedious and boring to do this manually while reading the text. It may be sped up with a word processing program in which Ctrl+F is a shortcut for a find command. In Julia this could be done with eachmatch like so:

function getAllMatches(rmi::Base.RegexMatchIterator)::Vec{Str}
    allMatches::Vec{RegexMatch} = collect(rmi)
    return isempty(allMatches) ? [] :
        [regMatch.match for regMatch in allMatches]
end

getAllMatches(eachmatch(r"John Smith", txt))[1:2]
["John Smith", "John Smith"]

Here we defined a little function (getAllMatches), that will help us to extract the matches as a vector of strings, which is easier to read than the default structure returned by eachmatch. Notice, the r"John Smith" argument in eachmatch. The r indicates that the following characters compose no ordinary string, but a special one that is called a regular expression (or regex). It may not seem like much right now, but we’ll see its potential in a moment.

Once we confirmed the phrase existence we may wish to obfuscate it. Again, in a word processing program this would be likely done with Ctrl+H that stands for find and replace command. In Julia, we would do it with something like:

# in Julia strings are immutable
# to make changes permament write it to `txt` and/or to a file
txt = replace(txt, r"John Smith" => "John S")
eachmatch(r"John Smith", txt) |> getAllMatches
String[]

There, we did our job, the identity of an accused person is protected. We may now write the file on a disk and send the press report. I imagine now you’re wondering what’s the big deal with those regexes anyway. For a person with basic computer literacy what we’ve done doesn’t seem particularly advanced. Well, you’re right. It it not. That’s because in order to have a regex we need to use some meta-characters, i.e. special symbols that are interpreted beyond their literal meaning. On the other hand, as a general rule, any letter or digit in regex (like r"JohnSmith") stands for itself. Overall, the list of meta-characters is rather long, but as stated in the docs it may be found at the PCRE2 syntax manpage.

Instead of going through all the meta-characters (admittedly an impossible task for a short book chapter) let me just demonstrate a few of the more important ones with some illustrative examples.

18.1.1.1 Example 1

txt = getTxtFromFile("./loremDates.txt")
"<<< " * txt[1:200] * " ... >>>"

<<< This is a lorem ipsum text from: https://en.wikipedia.org/wiki/Lorem_ipsum it contains a randomly placed dates (years). It is used for educational purpose only.

Lorem ipsum dolor sit 2000 amet consec … >>>

This time, since I study for an exam, my txt contains a passage from a history book. I would like to extract the dates from it to make sure I know them all. Let’s say that the dates cover years between 1000 AD and the present. Doing a standard string search is no good, after all I would have to check like a thousand numbers. But wait, a simple regex can save me a lot of work. Observe:

eachmatch(r"[0-9][0-9][0-9][0-9]", txt) |> getAllMatches
["2000", "1989", "1517", "1492", "1410", "1918", "1969", "1776", "2001"]

This returned all 4-digit sets in the order they appear in the text (left to right, top to bottom).

In the regex (r"..."), the [...] is a positive character class that matches any of the enclosed characters. Therefore, [0123456789] would mean match any character used to represent a digit (0 or 1 or 2 or …). In general the contents of a positive character class are interpreted literally with the exception of \, ^ at the beginning, and - between two characters. In the last case, the hyphen (-) means any character within a range. Typically its used in the following configurations: [0-9], [a-z], [A-Z], [A-z], or [A-z0-9]. The range is likely determined based on the underlying codes (e.g. like ASCII). Therefore, if you want to match any letter (capital or small) the regex must be written as [A-z] and not [a-Z]. Anyway, in our case a regex of the form r"[0-9][0-9][0-9][0-9]") means match a digit ([0-9]) followed by a digit ([0-9]), followed by a digit ([0-9]), followed by a digit ([0-9]) (exactly 4 digits in a row).

Interestingly, we could save ourselves even more typing by using other meta-characters for this problem, i.e.

eachmatch(r"[0-9]{4}", txt) |> getAllMatches
["2000", "1989", "1517", "1492", "1410", "1918", "1969", "1776", "2001"]

The {4} means exactly 4 repetitions of a preceding character class (which is [0-9], so a digit).

Some newer regex engines allow to shorten it even more:

eachmatch(r"\d{4}", txt) |> getAllMatches
["2000", "1989", "1517", "1492", "1410", "1918", "1969", "1776", "2001"]

Where \d means any digit (in general \ gives a special meaning to the following ordinary character) and {4} still designates exactly 4 repetitions of a previous token.

18.1.1.2 Example 2

txt = getTxtFromFile("./loremDollarsDates.txt")
"<<< " * txt[1:200] * " ... >>>"

<<< This is a lorem ipsum text from: https://en.wikipedia.org/wiki/Lorem_ipsum it contains a randomly placed amounts of money and years. It is used for educational purpose only.

Lorem ipsum dolor sit $11 … >>>

This time, we got a text that contains both dollars quota (in $123 format) and dates, but we’re interested only in the former. Let’s say we want to add them up to find out how much do we need to pay.

First, we’ll try to get the numbers out. If we assume for a moment that the amount of money is at least 3 digits long then for our first try we might go with:

eachmatch(r"\d.+\d", txt) |> getAllMatches
[
"112 amet consectetur 1234",
"200",
"173 exercitation ullamco 1492",
"1180 Duis aute irure dolor in $122",
"113 cillum dolore eu $3333",
"444 sint occaecat cupidatat non $212",
"534"
]

Here \d means a digit, . is any character (except for newline), and + stands for one or more of the preceding tokens (so match a digit followed by one or more characters, followed by a digit). There is a small problem though, we caught more than we wanted. That’s because by default, regexes are greedy (usually they match as much as they can until a line ends) if we want to make it more temperate we need to follow .+ with ? (one or more characters, but as few as you can to fulfill the condition).

eachmatch(r"\d.+?\d", txt) |> getAllMatches
[
"112",
"123",
"200",
"173",
"149",
"118",
"0 Duis aute irure dolor in \$1",
"113",
"333",
"444",
"212",
"534"
]

An improvement, but we’re still not there. Let’s try again.

eachmatch(r"\d{1,}", txt) |> getAllMatches
[
"112"
"1234"
"200"
"173"
"1492"
"1180"
"122"
"113"
"3333"
"444"
"212"
"534"
]

Pretty good, here {i,j} means between i and j (inclusive - inclusive) occurrences of the previous token (\d). {,j} stands for 0 to j and {i,} stands for i or more. Therefore, we only match 1 or more digits in a row, so it seems that we are finally there. Well, not quite, right now we got no way to tell which digits denote money and which years (they’re from the file loremDollarsDates.txt).

Note: You need to be precise while typing the quantifiers. Typing eachmatch(r"\d{1, }", txt) |> getAllMatches (it contains an extra space in \d{1, }) will give you no matches (empty vector).

Let’s try to change our regex a bit to extract only dollars.

eachmatch(r"$\d{1,}", txt) |> getAllMatches
String[]

Hmm, we wanted to extract a dollar symbol $ with all the following digits. Oddly enough that seemed to have failed. That’s because $ is a meta-character that denotes end of a subject (usually end of a line or end of a string). So, actually what we said with $\d{1,} was: find digits after the end of a string. An impossible task, hence the empty vector as a result. If we want $ to be interpreted as a regular dollar symbol we need to proceed it with \ (\ gives a special meaning to an ordinary character, like in \d, and strips it away from a special character like $).

eachmatch(r"\$\d{1,}", txt) |> getAllMatches
[
"$112",
"$200",
"$173",
"$1180",
"$122",
"$113",
"$3333",
"$444",
"$212",
"$534"
]

Finally, we can add it up using, e.g. this few liner:

eachmatch(r"\$\d{1,}", txt) |> getAllMatches |>
vecStrDollars -> replace.(vecStrDollars, "\$" => "") |>
vecStrNumbers -> parse.(Int, vecStrNumbers) |>
sum
6423

And voila, we’re done. Notice, however, that the regex isn’t perfect. For example, it doesn’t handle correctly the amounts of money that contain floating point values (or negative quotas). If that were a requirement, we would would have to improve upon it.

18.1.1.3 Example 3

This time we got a few random telephone numbers.

Rnd.seed!(9)
telNums = [join(Rnd.rand(string.(0:9), 9)) for _ in 1:3]
["304039945", "545946090", "818309467"]

Our task is to convert them into more readable form, i.e. xxx-xxx-xxx.

replace.(telNums, r"(\d{3})(\d{3})(\d{3})" => s"\1-\2-\3")
[
    "304-039-945",
    "545-946-090",
    "818-309-467"
]

The new elements here are () and \number which are capture groups and back-references, respectively. Therefore, (\d{3}) in a regex (r"") means capture any three digits in a row and remember them, whereas \1 in the substitution (s"" - denotes a substitution string that may use meta-characters) means: use the first captured and remembered group (by analogy \2 is for the second captured group and \3 is for the third).

18.1.1.4 Example 4

In Section 4 we dealt with two-way text transformations between camelCase and snake_case.

Let’s do this with regexes. We’ll start with camelCasedWords:

camelCasedWords = [
    "helloWorld", "niceToMeetYou", "translateToEnglish"
]

eachmatch.(r"([A-Z])", camelCasedWords) .|> getAllMatches
[
    ["W"],
    ["T", "M", "Y"],
    ["T", "E"]
]

First, we capture (()) any capital letter ([A-Z]) in a string. Now we would like to lowercase it. Per pcre2 syntax manual we should be able to do this using \l escape sequence (it means lowercase next character), but for whatever reason the following snippet throws an error:

replace.(camelCasedWords, r"([A-Z])" => s"\l\1")

No, biggie. Julia allows the second argument of replace to be a pair of the form regex => function that operates on a matched string. We can use that to our advantage:

replace.(camelCasedWords, r"([A-Z])" => lowercase)
[
"helloworld",
"nicetomeetyou",
"translatetoenglish"
]

Almost there, we just need to precede the lower-cased letter with _. This could be done with an anonymous function, e.g. like this:

replace.(camelCasedWords, r"([A-Z])" => AtoZ -> "_" * lowercase(AtoZ))
[
"hello_world",
"nice_to_meet_you",
"translate_to_english"
]

or like that (here we use a template string):

replace.(camelCasedWords, r"([A-Z])" => AtoZ -> "_$(lowercase(AtoZ))")
[
"hello_world",
"nice_to_meet_you",
"translate_to_english"
]

Nice.

Now, it’s time for the opposite transformation.

snakeCasedWords = ["hello_world",
    "nice_to_meet_you", "translate_to_english"
]

replace.(snakeCasedWords, r"_[a-z]" => _atoz -> uppercase(_atoz[2:end]))
# or
replace.(snakeCasedWords,
    r"_[a-z]" => _atoz -> uppercase(strip(_atoz, '_')))
[
"helloWorld",
"niceToMeetYou",
"translateToEnglish",
]

Wow, that felt like a breeze.

Overall, the two lines of code (replace.(camelCasedWord, etc.) and replace.(snakeCasedWords, etc.)) are the equivalent of roughly 20 lines of code in Section 4.2. And that’s how it usually is, regexes are more succinct than the traditional functions, although they’re not necessarily faster to write (especially if that is your first encounter with the subject).

18.1.1.5 Summary

Here’s a quick reminder of what we learned about regexes and meta-characters:

  1. in general any letter or digit that occurs in regex stands for itself;
  2. a positive character class is denoted by square brackets and is often used to capture a range of characters, like [a-z], [A-Z], [0-9], [A-z], or [A-z0-9];
  3. {} is a quantifier, it specifies a quantity of the previous token, where {i}, {i,j}, {i,}, and {,j} mean: exactly i, between i and j, at least i and up-to j previous tokens, respectively;
  4. \ bestows a special meaning on an ordinary character (\d denotes any digit), or strips it away from a special character ($ - is end of a string, wheres \$ is a dollar symbol);
  5. (sth) inside of r"" stands for capture and remember, whereas \1 in s"" denotes back-reference to the first capture;
  6. the second argument of replace is usually a pair of the form: r"" => "" (regex => regular string), r"" => s"" (regex => substitution string), r"" => function (regex => function that accepts a string and returns a string possibly applying some transformations on the way).

18.1.2 Regex Tasks

OK, time to put what you’ve learned to good use. If, while solving the tasks, you need a visual assistant that helps you with regular expressions, then you may try e.g. regex101.

18.1.2.1 Regex Task 1

You got a series of dates in the US format “MMDDYYYY”:

datesMMDDYYYY = ["01.04.2025", "11.01.2018", "12.31.1999", "03.20.2026"]

The format is confusing to some (e.g. European) people. Change it to a less ambiguous “YYYY-MM-DD” configuration.

18.1.2.2 Regex Task 2

Read the contents of loremMail.txt that is to be found in the code snippets. It contains 8 random e-mail addresses (with repetitions). Use Julia to list the unique e-mail addresses found in the text.

18.1.2.3 Regex Task 3

Here’s a vector of random names:

# random names
names = ["Mary Johnson", "Eve Smith", "Tom Brown"]

Swap the names order with a regex (“Adam Smith” should become “Smith, Adam”) and sort them alphabetically in ascending order.

Can you do the same, but while accounting for possible middle names.

# random names
names = ["Jane Johnson", "Mary Jane Doe", "Peter Smith", "Adam Tom Brown"]

To add a small tweak, I want you to swap the names, abbreviate the middle name (“John Daniel Smith” should become “Smith, John D.”, whereas “Adam Smith” should become “Smith, Adam”) and then sort them alphabetically in ascending order.

18.1.2.4 Regex Task 4

In Section 28.2 we wrote a fmt function to format numbers to something like: “123,456 USD”.

Write a program (possibly a regex or regexes + some extra code) that will convert those numbers to the desired form (place , after every three numbers from right).

nums = [0, 1, 12, 123, 1234, 12345,
    123456, 1234567, 12345678, 123456789]

Can you modify your program so that it handles the following numbers correctly as well (e.g. 12345.678 should become “12,345.68 USD”):

nums = [0, 0.1, 1, 1.2, 12., 12.34, 123.456,
    1234, 12345, 12345.67, 123456.7, 1234567.89]

Well, let’s find out. Good luck.



CC BY-NC-SA 4.0 Bartlomiej Lukaszuk