Wednesday, 13 February 2019

Variable-Length Lookbehinds: actually possible in Perl/PCRE!

Get strapped in, because this is as close as you're going to get to breaking news in the world of regular expressions.



You know what the best thing is about being involved in regex circles? It's not the money or the fame or the women.. no no; it's that, on occasion, you get to witness incredible discoveries unfold before your very eyes. This is exactly what happened in a small StackExchange chatroom earlier, when fellow regex tinkerer Grimy cracked a problem that previously seemed insoluble. In doing so, he uncovered a technique for emulating a feature that has been coveted for years by Perl and PCRE users alike: variable-length lookbehinds.

Now, forget \K. That doesn't come close to having the expressive power we want. I'm talking about truly emulating variable-length lookbehinds; that is, doing so for all conceivable scenarios.

The Revelation


In response to a challenge to match an arbitrary string (in the domain of ^.*$) that contains no unique characters, a task I once dismissed as impossible in the general case, Grimy ingeniously came up with the following:

Wonderfully elegant and simple in principle. Let us begin by examining the core concept.

The Concept

((?<=(?= ... |(?3)).))
The trick is to remember that, while lookbehinds can only peek a fixed number of characters behind the current matching point, the lookaheads they contain are not subject to any such restriction. If \1 is a single character, (?<=(?=\1).) succeeds where (?<=\1) fails because in the former case you are telling the regex engine that it needs to go back precisely one character to look for \1.

Because recursive algorithms are tricky to follow, let us expand this subexpression a couple of levels deep to observe what's happening:
((?<=(?= ... |(?<=(?= ... |(?<=(?= ... |(?3)).)).)).))
Notice how we're moving back one character each step of the way. So if (?= ... ) fails to match, move back one character and try again. Thus, it is equivalent in essence to using (?<=(?= ...).*) if actual variable-length lookbehinds were possible.

The Nanny


If you've read and understood everything so far, you may have thought to look back at Grimy's original expression to try and understand it in full. What, then, you ask, is the purpose of \n in the lookahead? After all, \n, which matches a new-line character, should not exist in the domain of ^.*$ since `.` does not match a new-line by default. So why is it there?

Regex engines often have a bunch of nannies in place to prevent you from hurting yourself. One of these employed by PCRE is a compile-time check for recursive calls that appear, from a cursory static inspection, as though they may not terminate. The simplest example of this is the expression (?R), which yields the compile-time error:
40  recursive call could loop indefinitely
With Grimy's expression, the PCRE engine doesn't realize that \2 is guaranteed to match a single character. To work around this, we need to give the engine at least one non-empty path to follow. \n is a contradiction in the domain of ^.*$ and, more importantly, it functions as a non-empty alternative that convinces the nanny that we know what we're doing.

Note that in the more general domain of ^[\s\S]*$, which may include new-line characters, x^ serves as a non-empty contradiction.

The Emulation


Perspicacious readers may have observed that the method above involves an unbounded lookahead, one that may peer far past the position at which we wish to place our emulated lookbehind. This is fine for the problem it aimed to solve, but if we want to expand the method to work in the general case, this issue needs to be addressed.

And so, for a variable-length subexpression X, (?<=X) can be implemented in both Perl and PCRE as:
(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S]))
See it in action on regex101

To negate it and emulate (?<!X), simply enclose the whole thing in a negative lookahead:
(?!(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S])))
Aside from the fact that I've used two capturing groups, that snippet is basically plug-and-play: you may insert it anywhere in your expression where a variable-length lookbehind is desired.

The main difference here is that a new group "a" has been added to capture the portion of the string after the current matching point, so that when we travel backwards and look ahead for X we know where to stop. This solves the issue mentioned above.

Next, X has been moved to the outer group "b" as it would fail to match if X were zero-width since the lookahead only begins at the position one character behind the current matching point, and only moves backwards thereafter.

Here is a breakdown of the various components:
(?=(?'a'[\s\S]*))  # Capture the rest of the string in "a"
(?'b'
  X(?=\k'a'\z)     # Match X followed by the contents of "a" to ensure
                   # the emulated lookbehind stops at the correct point.
  |                # OR
  (?<=             # Look behind (one character) match either:
    (?=
      x^           # A contradiction; non-empty to appease the nanny
      |            # OR
      (?&b)        # Recurse (match X OR look behind (one character)) etc..
    )
    [\s\S]         # How far we go back each step: one single character
  )
) 

Note that by emulating a positive lookbehind in this way, you lose the luxury of having capture groups in X that return their values to the caller. But are you complaining?

The End


There you have it: another impossibility made possible by the creative application of old tools. I hope you enjoy using this new toy as much as I certainly will. Big shout out once again to Grimy for his stroke of genius.

As usual, I welcome you to follow me on twitter to stay apprised of more regex wackiness in the future.

The Bonus


What about problems that require comparing the substring at the current matching point to multiple substrings at later positions in the string - can we now handle those? Turns out we can! By iterating forwards with a lookahead and using our new toy to look back to the original matching point, we can make short work of any problem of that nature. For example, "match the longest word in a line" can now be solved in the general case:
\b(\w++)(?=(.*))(?!(.*\W)\b((?<=(?=(?=\1\2$)(?:(?=\w*+\3(\5?+\w))\w)++\b|(?4)).)))
Figuring out how that works is left as an exercise to the reader :) Suffice it to say, there is now a whole new set of problems for which we can irresponsibly abuse regex.

37 comments:

  1. Excellent Blog. I really want to admire the quality of this post. I like the way of your presentation of ideas, views and valuable content. No doubt you are doing great work. I’ll be waiting for your next post. Thanks .Keep it up! Kindly visit us @Luxury Boxes
    Premium Packaging
    Luxury Candles Box
    Earphone Packaging Box
    Wireless Headphone Box
    Innovative Packaging Boxes
    Wedding gift box
    Leather Bag Packaging Box
    Cosmetics Packaging Box
    Luxury Chocolate Boxes

    ReplyDelete
  2. One of the most prominent issues the students have to deal with while writing assignments is plagiarism. Hence, they extensively use plagiarism checker to check if there are any copied content in the paper.
    Another major reason for using plagiarism checkers is that universities do not accept plagiarized content. Plagiarism is a serious offence. Hence, if found, the students are suspended or might even lose the grades.
    Due to these limitations, it is evident that the plagiarism checking & word counter tool are not at all effective to check plagiarism. The term plagiarism is actually very broad. It is merely not coping with words. But these tools, unfortunately, detect words but not ideas. Hence, the chances of plagiarized papers remain.

    ReplyDelete
  3. Well-explained blog! Writer has described the complete information of Assignment Help in detailed manner. Find all details regarding expert help at greatassignmenthelp to achieve high marks and to boost your academic performance.
    Assignment Help Online
    Online Assignment Help
    Assignment Help Online Services
    Assignment Helper
    Assignment Assistance
    Assignment Help Experts
    Online Assignment Help Services
    Assignment Writing Help

    ReplyDelete
  4. A very high level post with a knowledgeable information .thanks you for giving me such a nice information. If you need any college level Assignment Help at reliable quality with better work. Kindly visit Ideal assignment help.

    ReplyDelete
  5. I like this blog very much, Its a very nice billet to read and incur Info. Also, check our sites..!!

    123.hp.com , 123.hp.com/setup , 123 hp com , 123 hp com setup, 123 hp setup , HP com setup

    ReplyDelete
  6. thanks for your details It's very nice and amazing.I got the informationweb design company in velachery

    ReplyDelete
  7. thanks for your information really good and very nice web design company in velachery

    ReplyDelete