Wednesday 13 February 2019

Variable-Length Lookbehinds: actually possible in Perl/PCRE!

Get strapped in, because this is as close as you're going to get to breaking news in the world of regular expressions.

You know what the best thing is about being involved in regex circles? It's not the money or the fame or the women.. no no; it's that, on occasion, you get to witness incredible discoveries unfold before your very eyes. This is exactly what happened in a small StackExchange chatroom earlier, when fellow regex tinkerer Grimy cracked a problem that previously seemed insoluble. In doing so, he uncovered a technique for emulating a feature that has been coveted for years by Perl and PCRE users alike: variable-length lookbehinds.

Now, forget \K. That doesn't come close to having the expressive power we want. I'm talking about truly emulating variable-length lookbehinds; that is, doing so for all conceivable scenarios.

The Revelation

In response to a challenge to match an arbitrary string (in the domain of ^.*$) that contains no unique characters, a task I once dismissed as impossible in the general case, Grimy ingeniously came up with the following:

Wonderfully elegant and simple in principle. Let us begin by examining the core concept.

The Concept

((?<=(?= ... |(?3)).))
The trick is to remember that, while lookbehinds can only peek a fixed number of characters behind the current matching point, the lookaheads they contain are not subject to any such restriction. If \1 is a single character, (?<=(?=\1).) succeeds where (?<=\1) fails because in the former case you are telling the regex engine that it needs to go back precisely one character to look for \1.

Because recursive algorithms are tricky to follow, let us expand this subexpression a couple of levels deep to observe what's happening:
((?<=(?= ... |(?<=(?= ... |(?<=(?= ... |(?3)).)).)).))
Notice how we're moving back one character each step of the way. So if (?= ... ) fails to match, move back one character and try again. Thus, it is equivalent in essence to using (?<=(?= ...).*) if actual variable-length lookbehinds were possible.

The Nanny

If you've read and understood everything so far, you may have thought to look back at Grimy's original expression to try and understand it in full. What, then, you ask, is the purpose of \n in the lookahead? After all, \n, which matches a new-line character, should not exist in the domain of ^.*$ since `.` does not match a new-line by default. So why is it there?

Regex engines often have a bunch of nannies in place to prevent you from hurting yourself. One of these employed by PCRE is a compile-time check for recursive calls that appear, from a cursory static inspection, as though they may not terminate. The simplest example of this is the expression (?R), which yields the compile-time error:
40  recursive call could loop indefinitely
With Grimy's expression, the PCRE engine doesn't realize that \2 is guaranteed to match a single character. To work around this, we need to give the engine at least one non-empty path to follow. \n is a contradiction in the domain of ^.*$ and, more importantly, it functions as a non-empty alternative that convinces the nanny that we know what we're doing.

Note that in the more general domain of ^[\s\S]*$, which may include new-line characters, x^ serves as a non-empty contradiction.

The Emulation

Perspicacious readers may have observed that the method above involves an unbounded lookahead, one that may peer far past the position at which we wish to place our emulated lookbehind. This is fine for the problem it aimed to solve, but if we want to expand the method to work in the general case, this issue needs to be addressed.

And so, for a variable-length subexpression X, (?<=X) can be implemented in both Perl and PCRE as:
See it in action on regex101

To negate it and emulate (?<!X), simply enclose the whole thing in a negative lookahead:
Aside from the fact that I've used two capturing groups, that snippet is basically plug-and-play: you may insert it anywhere in your expression where a variable-length lookbehind is desired.

The main difference here is that a new group "a" has been added to capture the portion of the string after the current matching point, so that when we travel backwards and look ahead for X we know where to stop. This solves the issue mentioned above.

Next, X has been moved to the outer group "b" as it would fail to match if X were zero-width since the lookahead only begins at the position one character behind the current matching point, and only moves backwards thereafter.

Here is a breakdown of the various components:
(?=(?'a'[\s\S]*))  # Capture the rest of the string in "a"
  X(?=\k'a'\z)     # Match X followed by the contents of "a" to ensure
                   # the emulated lookbehind stops at the correct point.
  |                # OR
  (?<=             # Look behind (one character) match either:
      x^           # A contradiction; non-empty to appease the nanny
      |            # OR
      (?&b)        # Recurse (match X OR look behind (one character)) etc..
    [\s\S]         # How far we go back each step: one single character

Note that by emulating a positive lookbehind in this way, you lose the luxury of having capture groups in X that return their values to the caller. But are you complaining?

The End

There you have it: another impossibility made possible by the creative application of old tools. I hope you enjoy using this new toy as much as I certainly will. Big shout out once again to Grimy for his stroke of genius.

As usual, I welcome you to follow me on twitter to stay apprised of more regex wackiness in the future.

The Bonus

What about problems that require comparing the substring at the current matching point to multiple substrings at later positions in the string - can we now handle those? Turns out we can! By iterating forwards with a lookahead and using our new toy to look back to the original matching point, we can make short work of any problem of that nature. For example, "match the longest word in a line" can now be solved in the general case:
Figuring out how that works is left as an exercise to the reader :) Suffice it to say, there is now a whole new set of problems for which we can irresponsibly abuse regex.