Tuesday, 13 November 2018

For Regex Purists: Can Formal Regular Expressions Be Fun?

Pre Ramble


Can formal regular expressions be fun? This is a question I asked myself recently. Modern regex engines have strayed a ways away from regular expressions as they are formally defined, offering a vast playground of powerful new features. It’s almost as though they have departed from the Chomsky hierarchy and entered into a sort of Chomsky anarchy.

But can we still have fun within the shackles of Type-3 grammars? By "fun", I do of course mean "painful, difficult, and laborious exercise".

I believe the answer is yes!

Consider the exercise of negating a word, ie. matching a string that does not contain a particular word. It is true that most regex engines provide negative lookaheads for this sort of purpose, and regex implementations generally allow you to exclude matches or otherwise handle the result of a matching failure. However, if we restrict ourselves to the domain of formal regular expressions, in which the only objects are character symbols and the only permissible operations are concatenation, alternation, and quantification via the "Kleene star" (‘*’), the exercise becomes much more challenging and therefore interesting.

A ridiculous example


Enough foreplay. I know you came here to see a long and complicated regex, and so I scoured the dictionary (using regex, naturally) looking for words that would be most challenging to negate and found the perfect one: "phosphoribosylpyrophosphate". Technically two words, but that dictionary listed it as one. Here is its negation written in formal regular expressions (wherein start and end anchors are implied):
([^p]|p(p|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|lp(|y(|r(|op(|h(|o(|sp(|h(|a(|t))))))))))))))))))))p)*([^ph]|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*([^po]|o([^ps]|s(pho(ribosylpyrophospho)*s)*([^p]|p([^ph]|h([^po]|o(ribosylpyrophospho)*([^prs]|r([^pi]|i([^pb]|b([^po]|o([^ps]|s([^py]|y([^pl]|l([^p]|p([^pyh]|y([^pr]|r([^po]|o([^p]|p([^ph]|h([^po]|o([^ps]|s([^p]|p([^ph]|h([^pao]|a([^pt]|t[^pe]))))))))))))))))))))))))))*(|p(p|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|lp(|y(|r(|op(|h(|o(|sp(|h(|a(|t))))))))))))))))))))p)*(|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*(|p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|l(|p(|y(|r(|o(|p(|h(|o(|s(|p(|h(|a(|t))))))))))))))))))))))))))
I used negated character classes since we haven’t defined a formal alphabet, but if instead you’d like to visualize each one as a group of character alternations, you’re more than welcome.

Why is this difficult?


Obviously, the word is long. But this makes the task laborious, not necessarily difficult. What makes it difficult is the potential for many different "cycles"; that is, we can observe that partial matches of the word exist at several positions throughout the word itself: "phosph" appears twice; "pho" appears three times; etc.

To see why this sucks for us, let us consider a simpler example. Match a string that doesn’t contain "abc". How would you write a formal regular expression for this?

One good first attempt could be the following: ^([^a]|a([^b]|b([^c]|\$)|\$))*\$

(This is a shortened form of and equivalent to ^([^a]|a([^b]|\$)|ab([^c]|\$))*\$, another valiant attempt. But character classes aside, this is not technically formal since `\$` is not a component of formal regular expressions. We will ignore this for now in the interests of simplifying our investigation, but know that there is an elegant way to eliminate these end anchors that is too large to fit in the margin will be introduced later in the post)

At first glance, this may look passable. Clearly it matches all strings that don’t contain "abc". But does it only match those strings? Unfortunately not. It hiccups when either `[^b]` or `[^c]` matches ‘a’, such as in "aabc". Then the outer group quantified by `*` iterates, at which point `[^a]` is free to match ‘b’ and then ‘c’ after that. See here.

How can we fix this? If the problem is that we’re matching too much, perhaps we need to make the expression more restrictive. If `[^b]` and `[^c]` should not match ‘a’, let us simply add ‘a’ to those negated character classes:
 ^([^a]|a([^ab]|b([^ac]|\$)|\$))*\$
Have we fixed it? Not yet. Sorry. I’m afraid your regular expression is in another castle.

We’ve actually made it too restrictive: it successfully doesn’t match all strings that contain "abc", but now it also fails to match a few that don’t (such as "aba"). This is the opposite problem to before. The good news is that we’re getting closer to the correct solution.

Let’s think about this.

When we first entered the outer group, at the start of the string, no characters had been matched yet. More importantly, no partial occurrences of "abc" had been matched yet. If we enter the outer group after having matched "a" or "ab", then we leave ourselves open to matching the rest of "abc". This means that all partial matches of "abc" (ie. "a" and "ab") need to be fully matched by each iteration of the outer group.

This subtle yet important detail is the reason why our intuition falls short of being able to write these regular expressions quickly and easily. We must pay special attention to the potential for these partial matches. Here is what the correct solution looks like in this case:
 ^([^a]|a(a|ba)*([^b]|b([^c]|\$)|\$))*\$
`(a|ba)` is what I referred to earlier as a "cycle". In that position, "a" has been matched and "bc" is yet to be matched (or, indeed, not matched). If another "a" is matched, it is the same: "a" is matched and "bc" is yet to be matched. If "ba" is matched, the same still. In fact, we can continue in this fashion indefinitely, matching either "a" or "ba", because each time this happens we are simply cycling back to the very same position we were in when that first "a" was matched.

But not all cycles are this simple. A string may contain the potential for more than one cycle. Some cycles may take you back to positions past the first character. There may even be cycles within cycles. If we are to stand a chance of effectively tailoring these regular expressions to exclude any arbitrary string, we are going to need an algorithm.

JT’s Algorithm


I spent some time examining a wide variety of strings and building regular expressions to negate them, and I believe I’ve come up with the easiest and most direct method for doing so. I don’t know if I have any business labeling it an "algorithm" as I haven’t rigorously proven its effectiveness. However, I’m quite certain it works. Very scientific, I know.

The formal definition of the algorithm is presented below. It does a great job of making it look immensely complicated, but I assure you that even "phosphoribosylpyrophosphate" requires much less work to handle than it would seem.

If you are not interested in the technicalities and simply want to learn how to create your own string-negating regular expressions, skim the text here for technique then skip ahead to the examples.

JT’s Algorithm: Formal Definition


Let $s$ = $c_{1}c_{2}c_{3}...c_{n}$ be a string of characters, and let the negated character class $[$^$abc...]$ be a shorthand form of writing the group of alternations of all characters in the defined alphabet not contained in $abc...$. Then, using only the operations defined by formal regular expressions, you can follow the below procedure to write an expression that describes the language of all strings that do not contain $s$ at any position.

1. Create an upside-down word staircase by writing the full string on the first line and removing one character from the end on each subsequent line:
$l_{n}$ = $c_{1}c_{2} ... c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}$
...
$l_{3}$ =$c_{1}c_{2}c_{3}$
$l_{2}$ = $c_{1}c_{2}$
$l_{1}$ = $c_{1}$
2. Identify the cycles that may follow the substring on each line. For a given $l_{i}$, a cycle is any string of characters $o_{1}o_{2}...o_{m}$ such that:
  • For $k$ = 1 to $m-1$: $o_{k}$ = $c_{i+k}$
  • $o_{m}$ = $c_{i}$
  • $o_{m}$ != $c_{i+m}$
(Note that if $o_{m}$ = $c_{i+m}$ then you have not encountered a cycle, but rather a progression through $s$)

Look at the lines above $l_{i}$ and consider which of them, when followed by $c_{i}$ instead of the next character in the string $s$, would result in a partial match of $s$, wherein $c_{1}...c_{i}$ is matched and $c_{i+1}...c_{n}$ is yet to be matched. There could be multiple cycles on one line, or none at all. Write them out in the following way, combining them with the alternation operator '|' in a group quantified by the Kleene star `*`:
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}...c_{i}|c_{i+1}c_{i+2}c_{i+3}...c_{i}|...)*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}...c_{3}|c_{4}c_{5}c_{6}...c_{3}|...)*$
$l_{2}$ = $c_{1}c_{2}(c_{3}c_{4}...c_{2}|c_{3}c_{4}c_{5}...c_{2}|...)*$
$l_{1}$ = $c_{1}(c_{2}c_{3}...c_{1}|c_{2}c_{3}c_{4}...c_{1}|...)*$
(Note, save a copy of this formulation for step 5)

3. Combine the cycles in each group using the fact that $(c_{i+1}c_{i+2}...c_{i}|c_{i+1}c_{i+2}c_{i+3}...c_{i})$ = $(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})$. Aim to eliminate multiple occurrences of any particular character (except $c_{1}$) from each cycle group. This will be useful for the next step.
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*$
$l_{2}$ = $c_{1}c_{2}(c_{3}c_{4}(|c_{5})...c_{2})*$
$l_{1}$ = $c_{1}(c_{2}c_{3}(|c_{4})...c_{1})*$
4. Starting from the top, recursively copy the cycle group for each line $l_{i}$ (along with its Kleene star) to just after wherever $c_{i}$ appears in the cycles below it.
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*$
$l_{2}$ = $c_{1}c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*$
$l_{1}$ = $c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*$
5. Now we can start building our expression. Let $g_{i}$ be the cycle group with star on line $l_{i}$. Then:
  • For $i$ = 1 to $n-1$: concatenate "$([$^$c_{1}c_{i}]|c_{i}g_{i}$"
  • Then concatenate "$[$^$c_{1}c_{n}]$"
  • Then for $i$ = 1 to $n-1$: concatenate ")"
The expression may then look something like this:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$
6. Some of those negated character classes may need extra characters added to them to prevent cycles from being matched down that path in the expression. The easiest way to do this is to go back to the unsimplified cycles in step 2 and:
For $i$ = 2 to $n-1$: if $c_{i}$ != $c_{1}$ then: for each cycle $o$ with length $m$: add $c_{i}$ to the character class "$[$^$c_{1}c_{i+m}]$".
6b. Simplify the negated character classes, knowing that "$[$^$cc]$" = "$[$^$c]$"

7. Remember earlier in the article I mentioned that there is an elegant way to overcome our lack of the end anchor `\$`? Here it is. Concatenate a copy of the entire pattern and put a Kleene star in between to quantify the first group:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$$*$$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$
Now go through the copy and delete all negated character classes from it:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))*(|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*(|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*(|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... (|c_{n-1}(c_{n-1})*) ... )))$
And you're done.

This is considerably less complicated in practice, as the examples below show.

JT's Algorithm: Easy Example


Let s = "mimic". Let's start off nice and gently.

1. Create the upside-down word staircase:
mimic
mimi
mim
mi
m
2. Write out the cycles:
mimic
mimi
mim(im)*
mi
m(m|imm)*
We first observe that "mimim" cycles back to "mim". Likewise, "mm" and "mimm" cycle back to simply "m". However, "mimi" does not cycle back to "mi" because it is itself a partial match of "mimic".

3. Combine cycles to eliminate repeated characters in groups.

No significant combining to be done.  `((im)?m)*` is no more useful to us than `(m|imm)*`

4. Recursively insert cycle groups into lower-order cycles:
mimic
mimi
mim(im)*mi
m(m|im(im)*m)*
The cycle for "mim" simply needed to be inserted at the position where "mim" is matched in `m(m|imm)*`

5. Build the expression.

We loop and concatenate "([^mm]|m(m|im(im)*m)*" with "([^mi]|i" then with "([^mm]|m(im)*" then "([^mi]|i" then finally "[^mc]", and then add all the closing parentheses:
([^mm]|m(m|im(im)*m)*([^mi]|i([^mm]|m(im)*([^mi]|i[^mc]))))
6. Add characters to negated character classes that lead to cycles.

Looking back at step 2, both sets of cycles result from matching 'm', which is the first letter of "mimic". Therefore nothing needs to be added to the negated character classes.

6b. Simplify character classes.

Although we didn't add anything in the previous step, we can still simplify the one character class containing a superfluous 'm':
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))
7. And finally, a bit of magic.

Copy the expression and add it to itself with a `*` in between, then remove all the negated character classes in the copy:
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))*([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))
Becomes:
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))*(|m(m|im(im)*m)*(|i(|m(im)*(|i))))
And there you have it: a formal regular expression to match any string that does not contain "mimic".


JT's Algorithm: Average Example


Let s = "peppers". Getting a little trickier here.

1. Create the upside-down word staircase:
peppers
pepper
peppe
pepp
pep
pe
p
2. Write out the cycles:
peppers
pepper
peppe
pepp
pep(pep)*
pe(pe)*
p(p|eppp|epperp)*
As with most of the strings you're likely to encounter, the last line contains the majority of the cycles.

We observe that "peppep" cycles back to "pep"; "pepe" cycles back to "pe"; and matching a "p" after any of "p", "pepp", or "pepper" cycles back to simply "p".

3. Combine cycles to eliminate repeated characters in groups.

Here we can combine the cycles in the last group to help us with the next step:
peppers
pepper
peppe
pepp
pep(pep)*
pe(pe)*
p(p|epp(|er)p)*
4. Recursively insert cycle groups into lower-order cycles.

Let's do this one in steps. First, we insert the cycle for "pep" into the appropriate positions in the cycles for "pe" and "p". Recall that this is done exactly where "pep" is matched in those lines:
peppers
pepper
peppe
pepp
pep(pep)*
pe(p(pep)*e)*
p(p|ep(pep)*p(|er)p)*
Then, insert the cycle you just formed for "pe" into its position in the cycle for "p":
peppers
pepper
peppe
pepp
pep(pep)*
pe(p(pep)*e)*
p(p|e(p(pep)*e)*p(pep)*p(|er)p)*
5. Build the expression.

Tedious, but what to do.
([^pp]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^pp]|p(pep)*([^pp]|p([^pe]|e([^pr]|r[^ps]))))))
6. Add characters to negated character classes that lead to cycles.

In this example, there is only one line with cycles whose corresponding character is not equal to the first character ('p'), and that is line two, the cycle line for "pe". Its only cycle has length 2, so we need to add $c2$ ('e') to the negated character class "[^$c_{1}c_{2+2}$]". $c_{4}$ = 'p', which means we are adding to the second occurrence of "[^pp]":
([^pp]|p(p|e(p(pep)*e)*p(pep)*p(er)?p)*([^pe]|e(p(pep)*e)*([^pp]|p(pep)*([^ppe]|p([^pe]|e([^pr]|r[^ps]))))))
6b. Simplify those negated character classes:
([^p]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^p]|p(pep)*([^pe]|p([^pe]|e([^pr]|r[^ps]))))))
7. My favourite step.

Copy the expression and add `*` in between, then remove those negated character classes in the second half:
([^p]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^p]|p(pep)*([^pe]|p([^pe]|e([^pr]|r[^ps]))))))*(|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*(|e(p(pep)*e)*(|p(pep)*(|p(|e(|r))))))
And we're done; you won't see any "peppers" getting through this formal regular expression.


JT's Algorithm: Difficult Example


Let s = "abaaabaababc". Grab a drink for this one.

1. Create the upside-down word staircase:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa
abaa
aba
ab
a
2. Write out the cycles.

This is tricky, but it requires nothing more than the same trusty logic that served us well earlier:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*
abaa(abaabaa)*
aba(aabaababa)*
ab(ab|aab|aaabab)*
a(a|baaaa)*
If, for example, I'm looking for the cycles of "abaaa", I look for all lines above it that end with "abaa" (that is, "abaaa" without the trailing 'a') then see if the next character is not 'a'. This is only the case with the line "abaaabaa": it ends with "abaa" and the next character in the line above is 'b'.

Apply a similar analysis to "abaa" and you'll find "abaaabaabaa" cycles back to it. Then you'll see "abaaabaababa" cycles back to "aba"; "abab", "abaab", and "abaaabab" all cycle back to "ab"; and, finally, both "aa" and "abaaaa" cycle back to simply "a".

The last line is surprisingly lacking in cycles since they are mostly dominated by the lines above it. Again, this is not typical; most words and phrases in practice do not contain significantly many partial matches of themselves.

3. Combine cycles to eliminate repeated characters in groups.

Here, we're only interested in the "ab" cycle group:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*
abaa(abaabaa)*
aba(aabaababa)*
ab(a(|a(|aba))b)*
a(a|baaaa)*
You will be very thankful you did this.

4. Recursively insert cycle groups into lower-order cycles.

Let's do it one line at a time. First the cycle for "abaaa":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*
aba(aa(baaa)*baababa)*
ab(a(|a(|a(baaa)*ba))b)*
a(a|baaa(baaa)*a)*
Then the cycle for "abaa":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)* abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)* ab(a(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*
a(a|baa(a(baaa)*baabaa)*a(baaa)*a)*
Now "aba":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)*ab(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*
a(a|ba(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*
And, finally, "ab":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)*ab(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*
5. Now the fun part.. building the expression!

Eleven painstaking iterations. Have you built a script to assist you with this yet? I haven't.
([^aa]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^aa]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^aa]|a(a(baaa)*baabaa)*([^aa]|a(baaa)*([^ab]|b([^aa]|a([^aa]|a([^ab]|b([^aa]|a([^ab]|b[^ac])))))))))))
Check your work by scanning over the second characters of each of those negated character classes in sequence and making sure they spell "abaaabaababc".

6. Add characters to negated character classes that lead to cycles.

Look back at step 2 and identify the one line with cycles whose corresponding character is not 'a'. This is the line "ab(ab|aab|aaabab)*".

This means we need to add 'b' to the negated character classes associated with "abaa", "abaaa", and "abaaabaa".
([^aa]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^aa]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^aab]|a(a(baaa)*baabaa)*([^aab]|a(baaa)*([^ab]|b([^aa]|a([^aab]|a([^ab]|b([^aa]|a([^ab]|b[^ac])))))))))))
6b. Simplify occurrences of "aa" in the negated character classes:
([^a]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^a]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^ab]|a(a(baaa)*baabaa)*([^ab]|a(baaa)*([^ab]|b([^a]|a([^ab]|a([^ab]|b([^a]|a([^ab]|b[^ac])))))))))))
7. The final step.

Almost there. Now we do our little copy and paste trick. I don't need to tell you to delete the character classes from the second half of it by now:
([^a]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^a]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^ab]|a(a(baaa)*baabaa)*([^ab]|a(baaa)*([^ab]|b([^a]|a([^ab]|a([^ab]|b([^a]|a([^ab]|b[^ac])))))))))))*(|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*(|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*(|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*(|b(|a(|a(|b(|a(|b)))))))))))
And hey presto, we're done. Wasn't that easy? Well, a damn sight easier than doing it by trial and error, at least. If you ever to negate the word "abaaabaababc" using formal regular expressions, you now have a recipe for doing so.

The End


This was my attempt at filling in a gap of information that exists where formal regular expressions are concerned. I hope, dear reader, that you've found some of it interesting.

Don't think that I've run out of creative material in the domain of regex just yet. There will be more, so I invite you to stay tuned by following me on Twitter (@DrRegex).

Thank you for reading.

2 comments: