Thursday, 6 September 2018

How to Match "A B C" where A+B=C: The Beast Tamed

A regex I submitted to Reddit recently climbed to the top of /r/programming and made quite a few heads explode in the process. As delightful as this was, I couldn't help but feel a little guilty for subjecting tens of thousands of people to this disgraceful pile of electronic fecal matter. Absolutely zero effort was put into making it something that even remotely resembled a useful, constructive demonstration. Instead, I lured you all into my own private lemon party and left you shocked, bewildered, and fearing for your lives. And for this I apologize.

At the risk of taking myself too seriously, I went ahead and added comments to the regex as well as tidied it up, re-wrote a couple of parts for greater clarity, and corrected a few glaring oversights. Some of you have levels of curiosity that far outweigh your better judgement, and so I invite you to read on.

First, it may be worth outlining the general method of handling addition when you're restricted to matching from left to right:

  1. First, compare the excess part of A or B with the excess part of C. They will either be equal, or C's will be greater by 1.
  2. Next, iterate through digits in A and match corresponding digits in B with their sums in C. Again, there may be differences of 1 depending upon the rest of the digits in A and B.
These potential differences of 1 ("carrying") are determined by moving through pairs of digits that sum to 9 until a pair is found whose sum exceeds 9.

# I wrapped the entire expression in (?!(?! )) just to do away with all
# captured substrings and cleanly match a verified line.


# Here we essentially right-align A, B, and C, ignoring leading zeros, 
# and populate backreferences with components that will be useful later.
#   \1    \2             \3    \4               \5    \6
(?=(\d*?)((?:(?=\d+\ 0*+(\d*?)(\d(?(4)\4))\ 0*+(\d*?)(\d(?(6)\6))$)\d)++)\ )
# Taking "12345 678 13023" as an example:
# \1 = "12", ie. the extra digits in A if A is longer than B. Empty otherwise.
# \2 = "345", ie. the rest of the digits in A that match up with those in B and C.
# \3 = "", ie. the extra digits in B if B is longer than A. Empty otherwise.
# \4 = "678", ie. the rest of the digits in B that match up with those in A.
# \5 = "13", ie. the extra digits in C that match up with the longer of A and B.
# \6 = "023", ie. the rest of the digits in C.

# This next part checks the extra digit portions to make sure everything is in order.
# There are two main paths to take:
# Easy:  Adding \2 to \4 results in no "carrying"; the length stays the same.
#        \5 should then exactly equal either \1 or \3, whichever was longer.
#        An example of this is when matching "5123 456 5579", since 123+456=579.
#   Then \5 = \1 = "5".
# OR
# Hard:  Adding \2 to \4 results in "carrying"; the length increases by 1. In this case,
#        \5 should equal 1 more than either \1 or \3 (which is non-empty).
#        This is the case we need to handle for our example of "12345 678 13023".
#        Here, \5 = "13" and \1 = "12", and so we need to verify \5 = \1 + 1. 

 # First thing to check is whether \2 + \4 results in carrying.
 # To do this, we must inspect \2 and \4 from the left and match
 # optional pairs of digits that sum to 9 until we find a pair that
 # sum to > 9. 
 # In our example, "345" and "678", we find that '3' and '6' sum to 9,
 # then '4' and '7' sum to > 9. Therefore we have carrying.
  # Consume the extra digits in A; they're not important here.
  # Move through all pairs of digits that sum to 9.
   # Collect the next digit of interest in B.
   (?=\d+\ 0*+\3((\g{-2}?+)\d))
   # This lookahead is used to set up a backreference that goes from one digit
   # of interest to the next, in the interests of simplifying the long check ahead.
   (?=\d(\d*\ 0*+\3\g{-2}))
   # Now to use that backreference to match pairs of digits that sum to 9.   
   # Consume a digit so we can move forward.
  # Now that we've gone through all pairs that sum to 9, let's try to match one
  # that sums to > 9.
  # First set up our backreference of convenience.
  (?=\d(\d*\ 0*+\3\g{-3}?+)) 
  # Now find a pair that sums to > 9. 
 # The above was a negative lookahead, so if it matched successfully then there is no
 # carrying and it's smooth sailing ahead.
 # Since either \1 or \3 (or both) is empty, the concatenation of the two will produce
 # what we need to match at the front of C. Then, \6 is the rest of C. 
 (?=\d+\ \d+\ 0*+\1\3\6$)
 # Carrying. This is where it gets complicated.

 # First let's move forward to the extra digits of interest.
 # ".*+" matches up to the end of the line with no backtracking. The only way
 # \3 can be found at that position is if \3 = "".
 # So if the negative lookahead succeeds, \3 isn't empty and B contains the
 # extra digits of interest, so we consume A and a space in that case. 
 (?(?!.*+\3)\d+\ )
 # More declarations for convenience.
 #       \11    \12
 (?=\d*?(\2|\4)(\ .*?0*+)\d+$)
 # \11 = the rest of the digits in A or B, \2 or \4, depending on where we're at.
 #  This anchor is important so we know where to stop matching the extra digits.
 # \12 = The part between the end of A/B and the beginning of C.
 # Another decision tree. Are the extra digits of interest composed solely of '9's,
 # such as in the example "999123 878 1000001"?
 # If so, the strategy is somewhat simplified.
 # This also handles zero '9's, when A and B are of equal length. 
 (?(?=9*\11\ )
  # If the extra digits of interest are composed solely of '9's, all we need
  # to do is pair '9's in A/B with '0's in C, and match a '1' at the start of C.

  # So, start pairing '9's and '0's.
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )
  # Now verify C actually starts with a '1', then match the '0's we've collected,
  # and also make sure all that follows is \6 (the rest of C).
  # Now the trickier path. We need to add 1 to extra digits in A/B and match it to C.
  # Because we know these extra digits are not composed solely of '9's, we know the
  # extra digits in C will be the same length.
  # How do you check if a number is 1 more than another given they're equal length?
  # First, iterate through the digits and match pairs of equivalent digits.
  # When you reach a position where they differ, it must be the case that C's
  # digit is 1 greater than A/B's. After this point, you need to pair '9's in A/B
  # with '0's in C until you exhaust the extra digits of interest. 
  # To see why this last part is necessary, consider the example "4129990 10 4130000".
  # When we compare "41299" to "41300", we first match '4' in A to '4' in C, then '1'
  # in A to '1' in C. Then we find the point where the next digit in C is 1 greater 
  # than the next one in A, and pair '2' in A with '3' in C. There can only be exactly
  # one such point like this. Afterwards, the only thing that could possibly follow is
  # a series of '9's in A and '0's in C until we exhaust the extra digits of interest. 
  # The first part, consume all equivalent digits.
  # Now we prepare for the next check by setting up a backreference that contains
  # everything in between the two digits of interest, for simplicity.

  # Match pairs that differ by 1 in favour of C.
  # Now consume any and all additional '9's, pairing them with '0's in C.
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )

  # Now verify C by checking it contains all extra digits shared with A/B, followed by
  # the lone digit that was found to be 1 greater than the corresponding one in A/B,
  # then any '0's that followed, and finally \6, the rest of its digits.

# At this point, we've managed to successfully verify the extra digits in A, B, and C.
# We have examined the "12" and "13" in our example of "12345 678 13023" and found them 
# to be sound. We would have rejected examples such as "11 1 22" and "92 8 110" by now,
# since their extra digits don't compute.
# The rest of the logic examines the equi-length portions of A and B (saved as \2 and \4
# respectively). This is actually simpler since we don't have to fuss around with things
# being different lengths; we took care of all that earlier.
# At this point, we can simply match pairs of digits in A and B to their sum in C.
# There is, however, still some considerations to be made as to carrying. We're iterating
# through digits from left to right, after all, and the sum of every pair of digits we
# sum in A and B may be found in C as either A+B(mod 10) or 1+A+B(mod 10) depending on
# whether carrying occurs to the right.

# Consume any extra digits in A; they're no longer important.

# Iterate through A, B, and C one digit at a time, from the left.
 # Here we set up backreferences to useful portions of the subject, ignoring any
 # leading '0's, and also ignoring those extra digits from before, \3 and \5.
 #   18           19  20                   21  22
 (?=(\d)\d*\ 0*+\3((\g{-2}?+)\d)\d*\ 0*+\5((\g{-2}?+)\d)) 
 # These values update as we iterate through A, but on the first run,
 # using our example of "12345 678 13023":
 # \18 = "3", ie. the next digit to inspect in A.
 # \19 = "6", ie. what we've examined in B including the next digit.
 # \20 = "", ie. what we've examined in B excluding the next digit.
 # \21 = "0", ie. what we've examined in C including the next digit.
 # \22 = "", ie. what we've examined in C excluding the next digit.
 # Like before, we must proceed in one of two directions based on whether or not we
 # encounter carrying. 
 # Similar to the first part, in order to determine this, we need to look at the parts
 # of A and B that follow our current digits of interest. And, as before, we sift through
 # any pairs of digits that total 9 until we find a pair whose sum is > 9.  
   # Consume the current digit of interest in A.
   # Then start matching pairs of digits in A and B whose sum is 9.
    # Use nested references to remember how far into B we are.
    (?=\d+\ 0*+\3\19((\g{-2}?+)\d))
    # Set up a backreference for our simple pair matching.
    (?=\d(\d*\ 0*+\3\19\g{-2}))
    # Match pairs of digits that sum to 9.
    # Consume a digit to move forward.
   # All that's left is to check if the next pair of digits in A and B has
   # a sum exceeding 9. Set up our convenient back reference and check.
   (?=\d(\d*\ 0*+\3\19\g{-3}?+)) 
   # Now test for a combination of digits whose sum is > 9.
 # The above negative lookahead succeeded, so fortunately we don't have to contend
 # with carrying in the first branch. We need to match pairs of digits in A and B
 # and their sum in C. I don't think there's a more clever way to do this in PCRE
 # than tabulating all the combinations.
  # First set up convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))
  # Now the ugly part.  
  |0\g{-2}(\d)\g{-2}\g{-1} # At least we can handle zeros
  |(\d)\g{-4}0\g{-3}\g{-1} # with a bit of intelligence.
 # And in the else branch, we have to deal with carrying. So we'll match pairs of
 # digits like up there, but this time we'll match pairs of digits in A and B with
 # their sum + 1 (mod 10) in C.

  # Convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))  
  # Almost done, let's just get through this last bit of ugliness.
 # Whew. It's over. Consume a digit so we can move forward and restart the fun.


# At the end of all this, if we can match a space then we have succeeded in matching all
# of A, and hence all of B and C. 


# I tripped up on these edge cases involving zeros the first time I made this.
^0+\ 0*(\d+)\ 0*\g{-1}$


# I'm not going to make that mistake again!
^0*(\d+)\ 0+\ 0*\g{-1}$


Demo on regex101 (comments removed)

See, it's not so scary after all!

In case anyone is still wondering in earnest why I did this, the simple answer is: it was fun! Once I saw this challenge on StackOverflow's codegolf section and realized it may be possible, I just couldn't stop thinking about it until I had a working solution.

If anyone out there shares my zany passion for things of this nature, I invite you to subscribe to this blog and/or follow me on Twitter. I do have plenty more madness lined up to share with the world.

Also, please know that I do actually spend time creating and helping others create regular expressions that are useful and serve a practical purpose. You are welcome to fire up your favourite IRC client and pop by Freenode's #regex if you ever need any advice or just want to shoot the shit. We've got a great team there who are always happy to help you conquer your regex woes.

Thanks for reading.

Tuesday, 7 November 2017

Match Nested Brackets with Regex: A new approach

My first blog post was a bit of a snoozefest, so I feel I ought to make this one a little shorter and more to the point. I'm going to show you how to do something with regular expressions that's long been thought impossible. Get excited.

The problem

You want to match a full outer group of arbitrarily nested parentheses with regex but you're using a flavour such as Java's java.util.regex that supports neither recursion nor balancing groups.

The solution


Proof: Java Regex or PCRE on regex101 (look at the full matches on the right)

Et voila; there you go. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:


Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:


Make sure '(' follows before doing any hard work.

Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
This lookahead deals with finding the next '('.
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So "(?!.*?\1)" ensures we don't match the same '(' again.
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid used to overcome this bug.

Handle ')'
This lookahead deals with finding the next ')'.
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
Fill \2 with the rest of the string. The above-mentioned bug is not applicable here, so this simple expression is sufficient.

Consume a single character so that the group can continue matching. It is safe to consume a character here because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.

Match as few times as possible until a balanced group has been found. This is validated by the following check.
Final validation
Match up to and including the last '(' found.
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).


So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regex features - no recursion or balancing groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

If you share my excitement for things of this nature then I encourage you to follow this blog, as I have a few more pearls of regex wisdom to offer in good time. Also in the cards is a regex quiz adventure that will test your skills in a variety of interesting and challenging ways. So please do stay tuned.

I'd like to thank Me-me on Freenode for inspiring this discovery with a clever attempt at one of my #regex challenges. Thank you for being man enough to attempt them!

Bonus: Optimization!

Just for fun, I managed to improve the performance of this technique from "disgustingly, horrendously awful" to just "awful". Using the example in the proofs above:

Original (16,257 steps):


Optimized (4,205 steps):


And just so you can get an idea of how poor this method truly is, here's how the conventional recursive method measures up:

Recursive (445 steps):


A whole order of magnitude more efficient.

Sunday, 5 November 2017

Lookahead Quantification: An utterly loopy trick

Sufficiently adventurous readers may have come across the technique of forward or nested references to do some tricky things with regex, such as non-recursively match a palindrome, match anbn, match simple recursive acronyms, and even match an isomorphic pair of words. The technique works well in cases where you can consume one character at a time while simultaneously performing tests that call for simple examination of only the remainder of the subject at the next matching point. But what of problems that require greater visibility of the subject at each pivotal position? We don't have variable-length lookbehinds in PCRE; how, then, are we to perform feats of absurdity? I'll cut to the chase in a second, but first allow me to put to you a challenge.

The Challenge

This is an example of a regex task that is simply stated, appears easy at first glance, but is impossible to solve in the general case given PCRE's current functionality:
"Match a character that appears only once in the subject."
That's it. Just find and match a single character, and it can be any character as long as it makes but one appearance.

Matched case-sensitively, this sentence contains 12 such characters.

And this one contains 8.

How hard could it be to match any one of them, right? I'll give you a few minutes to shake your head, scoff, minimize this article, and attempt this challenge in your favourite regex environment. 

Welcome back. Were you close? Did you try to test for uniqueness using a negative lookahead, only to realize that everything to the left of the current matching point is entirely inaccessible? Or did you instead run into shortcomings while trying to recursively match sets of repeated characters? I feel your pain. No matter what trick one pulls out of the bag, the features of PCRE seem to fall just short of providing the means to solve this deceptively difficult problem.

The (Partial) Answer

And so, I would like to introduce the method of lookahead quantification: a method that paves the way for partial solutions to problems of this type. This is not a method you can assume will chime with your every requirement; it is a method that, should the stars align in your favour, you would be able to merely get away with. It relies on quantifying (a limited number of times) a group consisting solely of a positive lookahead assertion, containing accumulating nested references, in order to iterate through the subject without altering the matching point. An example to partially solve the problem at hand:
See it in action at:

I say "partially" because it's clearly limited by the magic number "850", the reason for which I'll explain in a second. But for now let's go through the full expression. A description of what's happening can be summarized as follows:

"Examine one character at a time, without ever advancing the matching point from the start of the subject, checking to see if the current character doesn't occur twice (or more)."
^(?:            # Anchor to start. All checks must take place here for full visibility.
 (?=(\1.|)(.))  # Look ahead and either add a single character to \1, or initialize it to "". Then \2 becomes the next character to be tested.
){1,850}?       # The lazy '?', in essence, forces the engine to start checking characters for uniqueness from the beginning of the string. It therefore iterates as few times as possible until a character satisfying the remainder of the expression is found.
(?!.*\2.*\2)    # Test the current value of \2 for uniqueness in the whole subject.
\1\K\2          # If found, match up to \2 by first consuming \1, resetting the match with \K, and then matching \2.
First of all: why is the group necessary, ie. why do we not just quantify the lookahead? Because, dear reader, we are not allowed. Or, rather, no matter how we attempt to quantify a lookahead, it can only match at most once. This is understandable. In general, it is not expected that a zero-width assertion needs to match more than once, because it consumes no characters. Even though the case above is a perfectly sensible scenario for a hypothetical quantified lookahead, we must accept that such tactics are off-limits to us and proceed with resourcefulness by enclosing it in a non-capturing group.

After the first iteration of the group, \1 = "" and \2 = <the first character in the subject>. Because the group has been quantified with '?' to invoke laziness, the engine will relax at this point and move on to matching the rest of the expression. This is important. If you remove the '?' and make the match greedy, the engine will go through as many of the 850 iterations as possible to the point where \2 ends up being either the 850th character or the last character in the subject, whichever comes first. Only then will it move on to the rest of the expression and test \2 for uniqueness. If unsuccessful, the engine will not backtrack and try again with \2 as the previous character; because the group matched an empty string, it simply aborts. This is why the lazy quantifier is crucial, not to mention a more natural analogy of the conventional 'for' loops that are typically used to iterate through strings.

So, where does "850" come from, you ask? Well, you may verify that if you increase it - say, to "851" - regex101 will throw a compile error at you: "regular expression is too large". This is because the PCRE compiler handles range quantification by making copies of the quantified element. With a sufficiently large range, the total length of the copies will exceed the hard limit imposed by the compiler on the overall length of an expression (which, by default, is 64 kilobytes). It is possible to predict this magic number by calculation once you deduce the memory consumption of each of the subexpression's individual components, but I won't get into that in this article. If this technique interests you and you do use it for some practical purpose (intelligently, while heeding all of my implicit warnings), I recommend you try to discover these limits yourself using a simple binary search, knowing the upper limit is somewhere on the order of 1000s.

Can I make it less limited?

Not really. Well, sort of. If you wish to use this method for whatever reason but find that it is too limited for your particular purpose, you may be able to overcome this by applying additional slightly amended expressions to continue match attempts where you left off. For example, if the first expression I gave you only checks the first 850 characters for uniqueness, you can use a second expression to iterate through each character in the next portion of the subject:
Unfortunately, the inner expression has grown a little, so the new magic number is "779". This means that the earlier expression coupled with this one, used in succession where necessary, will allow you to test the first 1,629 characters of a string for uniqueness. If this is still not enough, then you can keep amending the expression and reapplying in a similar fashion to check the next however many characters, changing the magic numbers as appropriate. Or, y'know, you could forget all this nonsense and use your programming language to implement this in a much more efficient and maintainable manner. But.. where's the fun in that?

Where else can I use this method?

Now that you've seen and, with any luck, understood the earlier example, you may be wondering what other problems this method may lend itself to solving.

To speak broadly, any regex problem that seems impossible at first thought because it requires iteration and comparison of a variable number of substrings either in front of or, perish the thought, behind the current matching point could probably benefit from application of this method.

Note that you need not limit yourself to iterating over individual characters; any describable series of contiguous substrings is perfectly acceptable. Here is an example that loops through whole words in the subject:

Match the longest word in a string
\b(?!(?:(?=\S+((?(1)\1 \S+|)))){1,760}?(?:(?=\S++\1 (\2?+\S))\S)++\b)(\S+)
This time, it is in our best interests to wrap most of the expression in a negative lookahead. Why? Think about the difference between these two statements: "find a word such that all words ahead of it are no larger" vs. "find a word such that there does not exist a word ahead of it that's larger". Each of these describes a valid way to tackle the problem, and each lends itself to its own solution involving this loopy trick. If we choose to interpret it the first way, the result will be an expression that resembles the following:
\b(?:(?=\S+(\1 \S+|))(?!(?:(?=\S++\1 (\2?+\S))\S)++\b)){1,387}?(\S++)(?=\1$)
This is very similar in operation to the expression above, but the magic number "387" (which, by the way, is now a limit on a number of words not characters) is almost half of the other. By thinking about the problem slightly differently, we can move some of the logic from inside the non-capturing group, thereby increasing its permissible range.

Note to self: this method of cascading negation is useful and might be worth a blog post at some point in the future.

Match the character that occurs with the greatest frequency
Getting a little crazier now. Once again, it is to our benefit to place the meat of the logic in a negative lookahead. Then for each character \3 following the first character \1, we recursively match character pairs in order to see which occurs with greater frequency. If \3 is more common than \1, then after eating through all the non-\3s, non-\1s, and \1+\3 pairs, we will be left with one or more \3s.

There is a lot happening in this expression, and I leave it as an exercise to you to determine what. Mainly because I am too lazy to write anymore.

In Closing

I hope you've enjoyed not fallen asleep reading the above article, and can now leave this page a little smarter than when you came in. This is my first post, so I'd love to hear any suggestions on how I can make this kind of material more interesting or otherwise improve its presentation.

Thank you for reading.