Sunday 5 November 2017

Lookahead Quantification: An utterly loopy trick

Sufficiently adventurous readers may have come across the technique of forward or nested references to do some tricky things with regex, such as non-recursively match a palindrome, match anbn, and even match an isomorphic pair of words. The technique works well in cases where you can consume one character at a time while simultaneously performing tests that call for simple examination of only the remainder of the subject at the next matching point. But what of problems that require greater visibility of the subject at each pivotal position? We don't have variable-length lookbehinds in PCRE; how, then, are we to perform feats of absurdity? I'll cut to the chase in a second, but first allow me to put to you a challenge.

The Challenge


This is an example of a regex task that is simply stated, appears easy at first glance, but is impossible to solve in the general case given PCRE's current functionality:
"Match a character that appears only once in the subject."
That's it. Just find and match a single character, and it can be any character as long as it makes but one appearance.

Matched case-sensitively, this sentence contains 12 such characters.

And this one contains 8.

How hard could it be to match any one of them, right? I'll give you a few minutes to shake your head, scoff, minimize this article, and attempt this challenge in your favourite regex environment. 

Welcome back. Were you close? Did you try to test for uniqueness using a negative lookahead, only to realize that everything to the left of the current matching point is entirely inaccessible? Or did you instead run into shortcomings while trying to recursively match sets of repeated characters? I feel your pain. No matter what trick one pulls out of the bag, the features of PCRE seem to fall just short of providing the means to solve this deceptively difficult problem.

The (Partial) Answer


And so, I would like to introduce the method of lookahead quantification: a method that paves the way for partial solutions to problems of this type. This is not a method you can assume will chime with your every requirement; it is a method that, should the stars align in your favour, you would be able to merely get away with. It relies on quantifying (a limited number of times) a group consisting solely of a positive lookahead assertion, containing accumulating nested references, in order to iterate through the subject without altering the matching point. An example to partially solve the problem at hand:
^(?:(?=(\1.|)(.))){1,850}?(?!.*\2.*\2)\1\K\2
See it in action at: https://regex101.com/r/T0gJWO/1

I say "partially" because it's clearly limited by the magic number "850", the reason for which I'll explain in a second. But for now let's go through the full expression. A description of what's happening can be summarized as follows:

"Examine one character at a time, without ever advancing the matching point from the start of the subject, checking to see if the current character doesn't occur twice (or more)."
^(?:            # Anchor to start. All checks must take place here for full visibility.
 (?=(\1.|)(.))  # Look ahead and either add a single character to \1, or initialize it to "". Then \2 becomes the next character to be tested.
){1,850}?       # The lazy '?', in essence, forces the engine to start checking characters for uniqueness from the beginning of the string. It therefore iterates as few times as possible until a character satisfying the remainder of the expression is found.
(?!.*\2.*\2)    # Test the current value of \2 for uniqueness in the whole subject.
\1\K\2          # If found, match up to \2 by first consuming \1, resetting the match with \K, and then matching \2.
First of all: why is the group necessary, ie. why do we not just quantify the lookahead? Because, dear reader, we are not allowed. Or, rather, no matter how we attempt to quantify a lookahead, it can only match at most once. This is understandable. In general, it is not expected that a zero-width assertion needs to match more than once, because it consumes no characters. Even though the case above is a perfectly sensible scenario for a hypothetical quantified lookahead, we must accept that such tactics are off-limits to us and proceed with resourcefulness by enclosing it in a non-capturing group.

After the first iteration of the group, \1 = "" and \2 = <the first character in the subject>. Because the group has been quantified with '?' to invoke laziness, the engine will relax at this point and move on to matching the rest of the expression. This is important. If you remove the '?' and make the match greedy, the engine will go through as many of the 850 iterations as possible to the point where \2 ends up being either the 850th character or the last character in the subject, whichever comes first. Only then will it move on to the rest of the expression and test \2 for uniqueness. If unsuccessful, the engine will not backtrack and try again with \2 as the previous character; because the group matched an empty string, it simply aborts. This is why the lazy quantifier is crucial, not to mention a more natural analogy of the conventional 'for' loops that are typically used to iterate through strings.

So, where does "850" come from, you ask? Well, you may verify that if you increase it - say, to "851" - regex101 will throw a compile error at you: "regular expression is too large". This is because the PCRE compiler handles range quantification by making copies of the quantified element. With a sufficiently large range, the total length of the copies will exceed the hard limit imposed by the compiler on the overall length of an expression (which, by default, is 64 kilobytes). It is possible to predict this magic number by calculation once you deduce the memory consumption of each of the subexpression's individual components, but I won't get into that in this article. If this technique interests you and you do use it for some practical purpose (intelligently, while heeding all of my implicit warnings), I recommend you try to discover these limits yourself using a simple binary search, knowing the upper limit is somewhere on the order of 1000s.

Can I make it less limited?


Not really. Well, sort of. If you wish to use this method for whatever reason but find that it is too limited for your particular purpose, you may be able to overcome this by applying additional slightly amended expressions to continue match attempts where you left off. For example, if the first expression I gave you only checks the first 850 characters for uniqueness, you can use a second expression to iterate through each character in the next portion of the subject:
^(?:(?=(\1.|.{850})(.))){1,779}?(?!.*\2.*\2)\1\K\2
Unfortunately, the inner expression has grown a little, so the new magic number is "779". This means that the earlier expression coupled with this one, used in succession where necessary, will allow you to test the first 1,629 characters of a string for uniqueness. If this is still not enough, then you can keep amending the expression and reapplying in a similar fashion to check the next however many characters, changing the magic numbers as appropriate. Or, y'know, you could forget all this nonsense and use your programming language to implement this in a much more efficient and maintainable manner. But.. where's the fun in that?

Where else can I use this method?


Now that you've seen and, with any luck, understood the earlier example, you may be wondering what other problems this method may lend itself to solving.

To speak broadly, any regex problem that seems impossible at first thought because it requires iteration and comparison of a variable number of substrings either in front of or, perish the thought, behind the current matching point could probably benefit from application of this method.

Note that you need not limit yourself to iterating over individual characters; any describable series of contiguous substrings is perfectly acceptable. Here is an example that loops through whole words in the subject:

Match the longest word in a string
\b(?!(?:(?=\S+((?(1)\1 \S+|)))){1,760}?(?:(?=\S++\1 (\2?+\S))\S)++\b)(\S+)
This time, it is in our best interests to wrap most of the expression in a negative lookahead. Why? Think about the difference between these two statements: "find a word such that all words ahead of it are no larger" vs. "find a word such that there does not exist a word ahead of it that's larger". Each of these describes a valid way to tackle the problem, and each lends itself to its own solution involving this loopy trick. If we choose to interpret it the first way, the result will be an expression that resembles the following:
\b(?:(?=\S+(\1 \S+|))(?!(?:(?=\S++\1 (\2?+\S))\S)++\b)){1,387}?(\S++)(?=\1$)
This is very similar in operation to the expression above, but the magic number "387" (which, by the way, is now a limit on a number of words not characters) is almost half of the other. By thinking about the problem slightly differently, we can move some of the logic from inside the non-capturing group, thereby increasing its permissible range.

Note to self: this method of cascading negation is useful and might be worth a blog post at some point in the future.

Match the character that occurs with the greatest frequency
(.)(?!(?:(?=((?(2)\2.)\1*+)(.))){1,680}?((?=\1).(?4)*?\3|(?=\3).(?4)*?\1|(?(?!\1|\3).)+)*+\3)
Getting a little crazier now. Once again, it is to our benefit to place the meat of the logic in a negative lookahead. Then for each character \3 following the first character \1, we recursively match character pairs in order to see which occurs with greater frequency. If \3 is more common than \1, then after eating through all the non-\3s, non-\1s, and \1+\3 pairs, we will be left with one or more \3s.

There is a lot happening in this expression, and I leave it as an exercise to you to determine what. Mainly because I am too lazy to write anymore.

In Closing


I hope you've enjoyed not fallen asleep reading the above article, and can now leave this page a little smarter than when you came in. This is my first post, so I'd love to hear any suggestions on how I can make this kind of material more interesting or otherwise improve its presentation.

Thank you for reading.

7 comments:

  1. Very interesting. I previously thought that this was impossible in PCRE due to the lack of either variable-length lookbehind or non-atomic lookahead. It didn't occur to me to intentionally limit the maximum length of strings that can be tested. Incidentally, I'm now even more horrified by PCRE's design that I was before.

    Note that the maximum position of the located unique character can be increased from 850 to 977 with this optimization:

    ^(?:(?=(\1?(.)))){1,977}?(?!.*\2.*\2)\1(?<=\K.)

    In my regex engine, I implemented molecular lookahead as (?*), which makes this problem solvable with unbounded string length:

    ^(?*(.*?)(.))(?!.*\2.*\2)\1\K\2

    $ ./regex -x all -o '^(?*(.*?)(.))(?!.*\2.*\2)\1\K\2'
    p9oS4QjstDSQKz6WVn8EPfLVDZcCdOxrGbHUpz9B3gwhawAFe6u7kmiqedGub4alBxMFiH80jXoyK1gMEWvT5JUYvNYrt2q07l2Zyhk3RO1JCPTIILcm5nfsRA
    X

    ReplyDelete
  2. A big flaw in this regex so far is that if it doesn't find a unique character, it always loops the maximum number of times (977 in the case of the optimized version above), regardless of input length. This doesn't change its final result, but does make it needlessly slow. Here's a fix for that:

    ^(?>(?:(?=(\1?(.?)))){1,961}?(?:\2^|(?!.*\2.*\2)))(?!\2^)\1(?<=\K.)

    When constructing the above fix, I found what I'd consider to be yet another bug / design flaw in PCRE:

    ^(?>(?:(?=(\1?(.)?))){1,962}?(?(2)(?!.*\2.*\2)))(?(2)\1(?<=\K.)|$.)

    The above should work, but PCRE refuses to set \2 into a non-participating capture group if it participated in the previous iteration of the loop (so it just keeps its value from the previous iteration). I don't see any upside of it being designed this way, so if it was a deliberate choice, it's a very bad one. Here's a simpler test case showing this flaw:

    ^(?:(z)?(?(1)aa|a)){2}

    Give this the input "zaaaa" and by all rights, it should match "zaaa", but instead, it matches "zaaaa".

    ReplyDelete
  3. The method you sketched above of augmenting the maximum length would work a little bit, but here's one that uses nesting to square the maximum:

    ^(?:(?=(\1.{518}|))){1,519}?(?:(?=\1(\2?(.)))){1,518}?(?!.*\3.*\3)\1\2(?<=\K.)

    This allows up to 268842 characters to be tested for uniqueness. Here's the version that doesn't keep looping through the same portions of a string multiple times:

    ^(?>(?:(?=((?(1)(?:\1.{376}|()))))){1,375}?(?(2)|(?>(?:(?=\1(\3?(.?)))){1,376}?(?:\4^|(?!.*\4.*\4)))(?!\4^)\1\3(?<=\K.)))(?(2).^)

    This allows up to 141000 characters to be tested for uniqueness.

    Extending this to cubing helps even more:

    ^(?>(?:(?=((?(1)(?:\1.{51984}|()))))){1,228}?(?(2)|(?>(?:(?=((?(3)(?:\3.{228}|())|\1)))){1,228}?(?(4)|(?>(?:(?=\3(\5?(.?)))){1,228}?(?:\6^|(?!.*\6.*\6)))(?!\6^)\3\5(?<=\K.)))(?(4).^)))(?(2).^)

    This allows up to 11852352 characters to be tested for uniqueness. regex101.com complains about catastrophic backtracking, but I tested it in the command-line version of pcregrep and it worked fine.

    Cubing is the sweet spot. An attempt to extend it to fourth power starts hurting the maximum length, because PCRE puts a {65535} limit on quantifiers:

    ^(?>(?:(?=((?(1)(?:\1.(?:.{44217}){3}|()))))){1,51}?(?(2)|(?>(?:(?=((?(3)(?:\3(?:.{51}){51}|())|\1)))){1,51}?(?(4)|(?>(?:(?=((?(5)(?:\5.{51}|())|\3)))){1,51}?(?(6)|(?>(?:(?=\5(\7?(.?)))){1,51}?(?:\8^|(?!.*\8.*\8)))(?!\8^)\5\7(?<=\K.)))(?(6).^)))(?(4).^)))(?(2).^)

    This only allows up to 6765201 characters to be tested for uniqueness.

    ReplyDelete
  4. Golfed down the cubing version:

    ^(?>(?:(?=(\1.{55696}|((?=\1))|))){1,236}?(?(2)|(?>(?:(?=(\3.{236}|((?=\3))|\1))){1,236}?(?(4)|(?>(?:(?=\3(\5?(.?)))){1,236}?(?:\6^|(?!.*\6.*\6)))(?!\6^)\3\5(?<=\K.)))))(?!\2|\4)

    This allows up to 13144256 characters to be tested for uniqueness. And as another upshot, regex101.com will test longer strings before complaining of catastrophic backtracking.

    And a better optimization of the original, simple regex:

    ^(?:(?=(\1.|))){1,1109}?(?=\1(.))(?!.*\2.*\2)\1\K\2

    ReplyDelete
  5. Further optimization allows fourth power to have the advantage:

    ^(?>(?:(?=(\1(?:.{60500}){22}|))){1,110}?(?:(?=(\2.{12100}|\1))){1,110}?(?:(?=(\3.{110}|\2))){1,110}?(?:(?=(\4.|\3))){1,110}?(?=\4(.)|())(?(6)|(?!.*\5.*\5)\4\K\5))(?!\6)

    Slightly more efficient using a backtracking verb:

    ^(?>(?=(\1(?:.{60500}){22}|))){1,110}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|(*COMMIT)(?!))(?!.*\5.*\5)\4\K\5

    This allows up to 146,410,000 characters to be tested for uniqueness. Should be enough, right?

    The cubic version only does 34,645,976 characters:

    ^(?:(?=(\1.{53138}.{53138}|))){1,326}?(?:(?=(\2.{326}|\1))){1,326}?(?:(?=(\3.|\2))){1,326}?(?=\3(.)|(*COMMIT)(?!))(?!.*\4.*\4)\3\K\4

    ReplyDelete
  6. It doesn't even end there. Benefits are to be had going up to sixth power, and the most-significant upper quantifier can also be increased.

    "Portable" version:

    1,107 characters: ^(?>(?:(?=(\1.|))){1,1107}?(?=\1(.)|())(?(3)|(?!.*\2.*\2)\1\K\2))(?!\3)
    268,842 characters: ^(?>(?>(?=(\1.{518}|))){1,519}?(?>(?=(\2.|\1))){1,518}?(?=\2(.)|())(?(4)|(?!.*\3.*\3)\2\K\3))(?!\4)
    34,858,528 characters: ^(?>(?>(?=(\1.{53138}.{53138}|))){1,328}?(?>(?=(\2.{326}|\1))){1,326}?(?>(?=(\3.|\2))){1,326}?(?=\3(.)|())(?(5)|(?!.*\4.*\4)\3\K\4))(?!\5)
    149,072,000 characters: ^(?>(?>(?=(\1(?:.{60500}){22}|))){1,112}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|())(?(6)|(?!.*\5.*\5)\4\K\5))(?!\6)
    332,800,000 characters: ^(?>(?>(?=(\1(?:.{64000}.{64000}.{64000}.{64000}){10}|))){1,130}?(?>(?=(\2.{64000}|\1))){1,40}?(?>(?=(\3.{1600}|\2))){1,40}?(?>(?=(\4.{40}|\3))){1,40}?(?>(?=(\5.|\4))){1,40}?(?=\5(.)|())(?(7)|(?!.*\6.*\6)\5\K\6))(?!\7)
    342,392,832 characters: ^(?>(?>(?=(\1(?:.{62208}.{62208}.{62208}.{62208}){32}|))){1,43}?(?>(?=(\2(?:.{55296}.{55296}.{55296}.{55296}){6}|\1))){1,24}?(?>(?=(\3.{13824}|\2))){1,24}?(?>(?=(\4.{576}|\3))){1,24}?(?>(?=(\5.{24}|\4))){1,24}?(?>(?=(\6.|\5))){1,24}?(?=\6(.)|())(?(8)|(?!.*\7.*\7)\6\K\7))(?!\8)

    PCRE-only version, with backtracking verbs:

    1,108 characters: ^(?:(?=(\1.|))){1,1108}?(?=\1(.)|(*COMMIT)(?!))(?!.*\2.*\2)\1\K\2
    269,361 characters: ^(?>(?=(\1.{519}|))){1,519}?(?>(?=(\2.|\1))){1,519}?(?=\2(.)|(*COMMIT)(?!))(?!.*\3.*\3)\2\K\3
    34,964,804 characters: ^(?>(?=(\1.{53138}.{53138}|))){1,329}?(?>(?=(\2.{326}|\1))){1,326}?(?>(?=(\3.|\2))){1,326}?(?=\3(.)|(*COMMIT)(?!))(?!.*\4.*\4)\3\K\4
    149,072,000 characters: ^(?>(?=(\1(?:.{60500}){22}|))){1,112}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|(*COMMIT)(?!))(?!.*\5.*\5)\4\K\5
    335,360,000 characters: ^(?>(?=(\1(?:.{64000}.{64000}.{64000}.{64000}){10}|))){1,131}?(?>(?=(\2.{64000}|\1))){1,40}?(?>(?=(\3.{1600}|\2))){1,40}?(?>(?=(\4.{40}|\3))){1,40}?(?>(?=(\5.|\4))){1,40}?(?=\5(.)|(*COMMIT)(?!))(?!.*\6.*\6)\5\K\6
    342,392,832 characters: ^(?>(?=(\1(?:.{62208}.{62208}.{62208}.{62208}){32}|))){1,43}?(?>(?=(\2(?:.{55296}.{55296}.{55296}.{55296}){6}|\1))){1,24}?(?>(?=(\3.{13824}|\2))){1,24}?(?>(?=(\4.{576}|\3))){1,24}?(?>(?=(\5.{24}|\4))){1,24}?(?>(?=(\6.|\5))){1,24}?(?=\6(.)|(*COMMIT)(?!))(?!.*\7.*\7)\6\K\7

    ReplyDelete
  7. It turns out that this trick is not needed, and the task is solvable with PCRE in the general case. Grimy has discovered that recursion and fixed-length lookbehind can be used to emulate variable-length lookbehind:

    (.)(?!((?<=(?=.^|\1.*\1|(?2)).)))

    ReplyDelete