tag:blogger.com,1999:blog-6179667219199982633.post1917234792370348652..comments2024-01-31T12:50:21.810+04:00Comments on Dr. Regex: Lookahead Quantification: An utterly loopy trickJohn Tattarakishttp://www.blogger.com/profile/03164148578519350285noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-6179667219199982633.post-70969708911010013322019-02-12T21:46:17.289+04:002019-02-12T21:46:17.289+04:00It turns out that this trick is not needed, and th...It turns out that this trick is not needed, and the task is solvable with PCRE in the general case. <a href="https://codegolf.stackexchange.com/users/6484/grimy" rel="nofollow">Grimy</a> has discovered that recursion and fixed-length lookbehind can be used to <a href="https://chat.stackexchange.com/transcript/message/49012521#49012521" rel="nofollow">emulate variable-length lookbehind</a>:<br /><br /><b>(.)(?!((?<=(?=.^|\1.*\1|(?2)).)))</b>Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-38675573151995312222019-01-05T23:18:22.104+04:002019-01-05T23:18:22.104+04:00It doesn't even end there. Benefits are to be ...It doesn't even end there. Benefits are to be had going up to sixth power, and the most-significant upper quantifier can also be increased.<br /><br />"Portable" version:<br /><br />1,107 characters: ^(?>(?:(?=(\1.|))){1,1107}?(?=\1(.)|())(?(3)|(?!.*\2.*\2)\1\K\2))(?!\3)<br />268,842 characters: ^(?>(?>(?=(\1.{518}|))){1,519}?(?>(?=(\2.|\1))){1,518}?(?=\2(.)|())(?(4)|(?!.*\3.*\3)\2\K\3))(?!\4)<br />34,858,528 characters: ^(?>(?>(?=(\1.{53138}.{53138}|))){1,328}?(?>(?=(\2.{326}|\1))){1,326}?(?>(?=(\3.|\2))){1,326}?(?=\3(.)|())(?(5)|(?!.*\4.*\4)\3\K\4))(?!\5)<br />149,072,000 characters: ^(?>(?>(?=(\1(?:.{60500}){22}|))){1,112}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|())(?(6)|(?!.*\5.*\5)\4\K\5))(?!\6)<br />332,800,000 characters: ^(?>(?>(?=(\1(?:.{64000}.{64000}.{64000}.{64000}){10}|))){1,130}?(?>(?=(\2.{64000}|\1))){1,40}?(?>(?=(\3.{1600}|\2))){1,40}?(?>(?=(\4.{40}|\3))){1,40}?(?>(?=(\5.|\4))){1,40}?(?=\5(.)|())(?(7)|(?!.*\6.*\6)\5\K\6))(?!\7)<br />342,392,832 characters: ^(?>(?>(?=(\1(?:.{62208}.{62208}.{62208}.{62208}){32}|))){1,43}?(?>(?=(\2(?:.{55296}.{55296}.{55296}.{55296}){6}|\1))){1,24}?(?>(?=(\3.{13824}|\2))){1,24}?(?>(?=(\4.{576}|\3))){1,24}?(?>(?=(\5.{24}|\4))){1,24}?(?>(?=(\6.|\5))){1,24}?(?=\6(.)|())(?(8)|(?!.*\7.*\7)\6\K\7))(?!\8)<br /><br />PCRE-only version, with backtracking verbs:<br /><br />1,108 characters: ^(?:(?=(\1.|))){1,1108}?(?=\1(.)|(*COMMIT)(?!))(?!.*\2.*\2)\1\K\2<br />269,361 characters: ^(?>(?=(\1.{519}|))){1,519}?(?>(?=(\2.|\1))){1,519}?(?=\2(.)|(*COMMIT)(?!))(?!.*\3.*\3)\2\K\3<br />34,964,804 characters: ^(?>(?=(\1.{53138}.{53138}|))){1,329}?(?>(?=(\2.{326}|\1))){1,326}?(?>(?=(\3.|\2))){1,326}?(?=\3(.)|(*COMMIT)(?!))(?!.*\4.*\4)\3\K\4<br />149,072,000 characters: ^(?>(?=(\1(?:.{60500}){22}|))){1,112}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|(*COMMIT)(?!))(?!.*\5.*\5)\4\K\5<br />335,360,000 characters: ^(?>(?=(\1(?:.{64000}.{64000}.{64000}.{64000}){10}|))){1,131}?(?>(?=(\2.{64000}|\1))){1,40}?(?>(?=(\3.{1600}|\2))){1,40}?(?>(?=(\4.{40}|\3))){1,40}?(?>(?=(\5.|\4))){1,40}?(?=\5(.)|(*COMMIT)(?!))(?!.*\6.*\6)\5\K\6<br />342,392,832 characters: ^(?>(?=(\1(?:.{62208}.{62208}.{62208}.{62208}){32}|))){1,43}?(?>(?=(\2(?:.{55296}.{55296}.{55296}.{55296}){6}|\1))){1,24}?(?>(?=(\3.{13824}|\2))){1,24}?(?>(?=(\4.{576}|\3))){1,24}?(?>(?=(\5.{24}|\4))){1,24}?(?>(?=(\6.|\5))){1,24}?(?=\6(.)|(*COMMIT)(?!))(?!.*\7.*\7)\6\K\7<br />Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-81646746020623922812019-01-05T16:49:12.558+04:002019-01-05T16:49:12.558+04:00Further optimization allows fourth power to have t...Further optimization allows fourth power to have the advantage:<br /><br /><b>^(?>(?:(?=(\1(?:.{60500}){22}|))){1,110}?(?:(?=(\2.{12100}|\1))){1,110}?(?:(?=(\3.{110}|\2))){1,110}?(?:(?=(\4.|\3))){1,110}?(?=\4(.)|())(?(6)|(?!.*\5.*\5)\4\K\5))(?!\6)</b><br /><br />Slightly more efficient using a backtracking verb:<br /><br /><b>^(?>(?=(\1(?:.{60500}){22}|))){1,110}?(?>(?=(\2.{12100}|\1))){1,110}?(?>(?=(\3.{110}|\2))){1,110}?(?>(?=(\4.|\3))){1,110}?(?=\4(.)|(*COMMIT)(?!))(?!.*\5.*\5)\4\K\5</b><br /><br />This allows up to <b>146,410,000</b> characters to be tested for uniqueness. Should be enough, right?<br /><br />The cubic version only does 34,645,976 characters:<br /><br />^(?:(?=(\1.{53138}.{53138}|))){1,326}?(?:(?=(\2.{326}|\1))){1,326}?(?:(?=(\3.|\2))){1,326}?(?=\3(.)|(*COMMIT)(?!))(?!.*\4.*\4)\3\K\4Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-81776353437099071802019-01-05T14:18:18.726+04:002019-01-05T14:18:18.726+04:00Golfed down the cubing version:
^(?>(?:(?=(\1....Golfed down the cubing version:<br /><br /><b>^(?>(?:(?=(\1.{55696}|((?=\1))|))){1,236}?(?(2)|(?>(?:(?=(\3.{236}|((?=\3))|\1))){1,236}?(?(4)|(?>(?:(?=\3(\5?(.?)))){1,236}?(?:\6^|(?!.*\6.*\6)))(?!\6^)\3\5(?<=\K.)))))(?!\2|\4)</b><br /><br />This allows up to <b>13144256 characters</b> to be tested for uniqueness. And as another upshot, regex101.com will test longer strings before complaining of catastrophic backtracking.<br /><br />And a better optimization of the original, simple regex:<br /><br />^(?:(?=(\1.|))){1,1109}?(?=\1(.))(?!.*\2.*\2)\1\K\2Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-89464348095999790662019-01-04T10:39:27.346+04:002019-01-04T10:39:27.346+04:00The method you sketched above of augmenting the ma...The method you sketched above of augmenting the maximum length would work a little bit, but here's one that uses nesting to <i>square</i> the maximum:<br /><br />^(?:(?=(\1.{518}|))){1,519}?(?:(?=\1(\2?(.)))){1,518}?(?!.*\3.*\3)\1\2(?<=\K.)<br /><br />This allows up to 268842 characters to be tested for uniqueness. Here's the version that doesn't keep looping through the same portions of a string multiple times:<br /><br />^(?>(?:(?=((?(1)(?:\1.{376}|()))))){1,375}?(?(2)|(?>(?:(?=\1(\3?(.?)))){1,376}?(?:\4^|(?!.*\4.*\4)))(?!\4^)\1\3(?<=\K.)))(?(2).^)<br /><br />This allows up to 141000 characters to be tested for uniqueness.<br /><br />Extending this to cubing helps even more:<br /><br /><b>^(?>(?:(?=((?(1)(?:\1.{51984}|()))))){1,228}?(?(2)|(?>(?:(?=((?(3)(?:\3.{228}|())|\1)))){1,228}?(?(4)|(?>(?:(?=\3(\5?(.?)))){1,228}?(?:\6^|(?!.*\6.*\6)))(?!\6^)\3\5(?<=\K.)))(?(4).^)))(?(2).^)</b><br /><br />This allows up to <b>11852352 characters</b> to be tested for uniqueness. regex101.com complains about catastrophic backtracking, but I tested it in the command-line version of pcregrep and it worked fine.<br /><br />Cubing is the sweet spot. An attempt to extend it to fourth power starts hurting the maximum length, because PCRE puts a {65535} limit on quantifiers:<br /><br />^(?>(?:(?=((?(1)(?:\1.(?:.{44217}){3}|()))))){1,51}?(?(2)|(?>(?:(?=((?(3)(?:\3(?:.{51}){51}|())|\1)))){1,51}?(?(4)|(?>(?:(?=((?(5)(?:\5.{51}|())|\3)))){1,51}?(?(6)|(?>(?:(?=\5(\7?(.?)))){1,51}?(?:\8^|(?!.*\8.*\8)))(?!\8^)\5\7(?<=\K.)))(?(6).^)))(?(4).^)))(?(2).^)<br /><br />This only allows up to 6765201 characters to be tested for uniqueness.Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-27167373717868487322019-01-04T07:38:02.869+04:002019-01-04T07:38:02.869+04:00A big flaw in this regex so far is that if it does...A big flaw in this regex so far is that if it doesn't find a unique character, it always loops the maximum number of times (977 in the case of the optimized version above), regardless of input length. This doesn't change its final result, but does make it needlessly slow. Here's a fix for that:<br /><br />^(?>(?:(?=(\1?(.?)))){1,961}?(?:\2^|(?!.*\2.*\2)))(?!\2^)\1(?<=\K.)<br /><br />When constructing the above fix, I found what I'd consider to be yet another bug / design flaw in PCRE:<br /><br />^(?>(?:(?=(\1?(.)?))){1,962}?(?(2)(?!.*\2.*\2)))(?(2)\1(?<=\K.)|$.)<br /><br />The above should work, but PCRE refuses to set \2 into a non-participating capture group if it participated in the previous iteration of the loop (so it just keeps its value from the previous iteration). I don't see any upside of it being designed this way, so if it was a deliberate choice, it's a very bad one. Here's a simpler test case showing this flaw:<br /><br />^(?:(z)?(?(1)aa|a)){2}<br /><br />Give this the input "zaaaa" and by all rights, it should match "zaaa", but instead, it matches "zaaaa".Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.comtag:blogger.com,1999:blog-6179667219199982633.post-36317150287958994792019-01-04T05:49:42.556+04:002019-01-04T05:49:42.556+04:00Very interesting. I previously thought that this w...Very interesting. I previously thought that this was impossible in PCRE due to the lack of either variable-length lookbehind or non-atomic lookahead. It didn't occur to me to intentionally limit the maximum length of strings that can be tested. Incidentally, I'm now even more horrified by PCRE's design that I was before.<br /><br />Note that the maximum position of the located unique character can be increased from 850 to 977 with this optimization:<br /><br />^(?:(?=(\1?(.)))){1,977}?(?!.*\2.*\2)\1(?<=\K.)<br /><br />In <a href="https://github.com/Davidebyzero/RegexMathEngine" rel="nofollow">my regex engine</a>, I implemented molecular lookahead as (?*), which makes this problem solvable with unbounded string length:<br /><br />^(?*(.*?)(.))(?!.*\2.*\2)\1\K\2<br /><br />$ ./regex -x all -o '^(?*(.*?)(.))(?!.*\2.*\2)\1\K\2'<br />p9oS4QjstDSQKz6WVn8EPfLVDZcCdOxrGbHUpz9B3gwhawAFe6u7kmiqedGub4alBxMFiH80jXoyK1gMEWvT5JUYvNYrt2q07l2Zyhk3RO1JCPTIILcm5nfsRA<br />X<br />Davidebyzerohttps://www.blogger.com/profile/06591232839057305169noreply@blogger.com