You know what the best thing is about being involved in regex circles? It's not the money or the fame or the women.. no no; it's that, on occasion, you get to witness incredible discoveries unfold before your very eyes. This is exactly what happened in a small StackExchange chatroom earlier, when fellow regex tinkerer Grimy cracked a problem that previously seemed insoluble. In doing so, he uncovered a technique for emulating a feature that has been coveted for years by Perl and PCRE users alike: variable-length lookbehinds.

Now, forget \K. That doesn't come close to having the expressive power we want. I'm talking about truly emulating variable-length lookbehinds; that is, doing so for all conceivable scenarios.

### The Revelation

Wonderfully elegant and simple in principle. Let us begin by examining the core concept.

### The Concept

**((?<=(?= ... |(?3)).))**

Because recursive algorithms are tricky to follow, let us expand this subexpression a couple of levels deep to observe what's happening:

**((?<=(?= ... |(?<=(?= ... |(?<=(?= ... |(?3)).)).)).))**

### The Nanny

If you've read and understood everything so far, you may have thought to look back at Grimy's original expression to try and understand it in full. What, then, you ask, is the purpose of \n in the lookahead? After all, \n, which matches a new-line character, should not exist in the domain of ^.*$ since `.` does not match a new-line by default. So why is it there?

Regex engines often have a bunch of nannies in place to prevent you from hurting yourself. One of these employed by PCRE is a compile-time check for recursive calls that appear, from a cursory static inspection, as though they may not terminate. The simplest example of this is the expression (?R), which yields the compile-time error:

With Grimy's expression, the PCRE engine doesn't realize that \2 is guaranteed to match a single character. To work around this, we need to give the engine at least one non-empty path to follow. \n is a contradiction in the domain of ^.*$ and, more importantly, it functions as a non-empty alternative that convinces the nanny that we know what we're doing.40 recursive call could loop indefinitely

Note that in the more general domain of ^[\s\S]*$, which may include new-line characters, x^ serves as a non-empty contradiction.

### The Emulation

Perspicacious readers may have observed that the method above involves an unbounded lookahead, one that may peer far past the position at which we wish to place our emulated lookbehind. This is fine for the problem it aimed to solve, but if we want to expand the method to work in the general case, this issue needs to be addressed.

**And so, for a variable-length subexpression X, (?<=X) can be implemented in both Perl and PCRE as:**

**(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S]))**

To negate it and emulate (?<!X), simply enclose the whole thing in a negative lookahead:

**(?!(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S])))**

The main difference here is that a new group "a" has been added to capture the portion of the string after the current matching point, so that when we travel backwards and look ahead for X we know where to stop. This solves the issue mentioned above.

Next, X has been moved to the outer group "b" as it would fail to match if X were zero-width since the lookahead only begins at the position one character behind the current matching point, and only moves backwards thereafter.

Here is a breakdown of the various components:

**(?=(?'a'[\s\S]*)) # Capture the rest of the string in "a"
(?'b'
X(?=\k'a'\z) # Match X followed by the contents of "a" to ensure
# the emulated lookbehind stops at the correct point.
| # OR
(?<= # Look behind (one character) match either:
(?=
x^ # A contradiction; non-empty to appease the nanny
| # OR
(?&b) # Recurse (match X OR look behind (one character)) etc..
)
[\s\S] # How far we go back each step: one single character
)
) **

### Note that by emulating a positive lookbehind in this way, you lose the luxury of having capture groups in X that return their values to the caller. But are you complaining?

### The End

As usual, I welcome you to follow me on twitter to stay apprised of more regex wackiness in the future.

### The Bonus

What about problems that require comparing the substring at the current matching point to multiple substrings at later positions in the string - can we now handle those? Turns out we can! By iterating forwards with a lookahead and using our new toy to look back to the original matching point, we can make short work of any problem of that nature. For example, "match the longest word in a line" can now be solved in the general case:

**\b(\w++)(?=(.*))(?!(.*\W)\b((?<=(?=(?=\1\2$)(?:(?=\w*+\3(\5?+\w))\w)++\b|(?4)).)))**

This comment has been removed by the author.

ReplyDelete