## Thursday, 6 September 2018

### How to Match "A B C" where A+B=C: The Beast Tamed

A regex I submitted to Reddit recently climbed to the top of /r/programming and made quite a few heads explode in the process. As delightful as this was, I couldn't help but feel a little guilty for subjecting tens of thousands of people to this disgraceful pile of electronic fecal matter. Absolutely zero effort was put into making it something that even remotely resembled a useful, constructive demonstration. Instead, I lured you all into my own private lemon party and left you shocked, bewildered, and fearing for your lives. And for this I apologize.

At the risk of taking myself too seriously, I went ahead and added comments to the regex as well as tidied it up, re-wrote a couple of parts for greater clarity, and corrected a few glaring oversights. Some of you have levels of curiosity that far outweigh your better judgement, and so I invite you to read on.

First, it may be worth outlining the general method of handling addition when you're restricted to matching from left to right:

1. First, compare the excess part of A or B with the excess part of C. They will either be equal, or C's will be greater by 1.
2. Next, iterate through digits in A and match corresponding digits in B with their sums in C. Again, there may be differences of 1 depending upon the rest of the digits in A and B.
These potential differences of 1 ("carrying") are determined by moving through pairs of digits that sum to 9 until a pair is found whose sum exceeds 9.

# I wrapped the entire expression in (?!(?! )) just to do away with all
# captured substrings and cleanly match a verified line.
(?!(?!

^0*+

# Here we essentially right-align A, B, and C, ignoring leading zeros,
# and populate backreferences with components that will be useful later.
#
#   \1    \2             \3    \4               \5    \6
(?=(\d*?)((?:(?=\d+\ 0*+(\d*?)(\d(?(4)\4))\ 0*+(\d*?)(\d(?(6)\6))$)\d)++)\ ) # # Taking "12345 678 13023" as an example: # # \1 = "12", ie. the extra digits in A if A is longer than B. Empty otherwise. # \2 = "345", ie. the rest of the digits in A that match up with those in B and C. # \3 = "", ie. the extra digits in B if B is longer than A. Empty otherwise. # \4 = "678", ie. the rest of the digits in B that match up with those in A. # \5 = "13", ie. the extra digits in C that match up with the longer of A and B. # \6 = "023", ie. the rest of the digits in C. # This next part checks the extra digit portions to make sure everything is in order. # # There are two main paths to take: # Easy: Adding \2 to \4 results in no "carrying"; the length stays the same. # \5 should then exactly equal either \1 or \3, whichever was longer. # An example of this is when matching "5123 456 5579", since 123+456=579. # Then \5 = \1 = "5". # OR # Hard: Adding \2 to \4 results in "carrying"; the length increases by 1. In this case, # \5 should equal 1 more than either \1 or \3 (which is non-empty). # This is the case we need to handle for our example of "12345 678 13023". # Here, \5 = "13" and \1 = "12", and so we need to verify \5 = \1 + 1. (?=(?(?! # First thing to check is whether \2 + \4 results in carrying. # To do this, we must inspect \2 and \4 from the left and match # optional pairs of digits that sum to 9 until we find a pair that # sum to > 9. # # In our example, "345" and "678", we find that '3' and '6' sum to 9, # then '4' and '7' sum to > 9. Therefore we have carrying. # Consume the extra digits in A; they're not important here. \1 # Move through all pairs of digits that sum to 9. (?: # Collect the next digit of interest in B. (?=\d+\ 0*+\3((\g{-2}?+)\d)) # This lookahead is used to set up a backreference that goes from one digit # of interest to the next, in the interests of simplifying the long check ahead. (?=\d(\d*\ 0*+\3\g{-2})) # Now to use that backreference to match pairs of digits that sum to 9. (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5 |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0) # Consume a digit so we can move forward. \d )*+ # Now that we've gone through all pairs that sum to 9, let's try to match one # that sums to > 9. # First set up our backreference of convenience. (?=\d(\d*\ 0*+\3\g{-3}?+)) # Now find a pair that sums to > 9. (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}|3\g{-1}[7-9] |4\g{-1}[6-9]|6\g{-1}4|7\g{-1}|8\g{-1}[2-4]|9\g{-1}[1-4]) ) # The above was a negative lookahead, so if it matched successfully then there is no # carrying and it's smooth sailing ahead. # Since either \1 or \3 (or both) is empty, the concatenation of the two will produce # what we need to match at the front of C. Then, \6 is the rest of C. (?=\d+\ \d+\ 0*+\1\3\6$)

|

# Carrying. This is where it gets complicated.

# First let's move forward to the extra digits of interest.
# ".*+" matches up to the end of the line with no backtracking. The only way
# \3 can be found at that position is if \3 = "".
# So if the negative lookahead succeeds, \3 isn't empty and B contains the
# extra digits of interest, so we consume A and a space in that case.
(?(?!.*+\3)\d+\ )

# More declarations for convenience.
#       \11    \12
(?=\d*?(\2|\4)(\ .*?0*+)\d+$) # \11 = the rest of the digits in A or B, \2 or \4, depending on where we're at. # This anchor is important so we know where to stop matching the extra digits. # \12 = The part between the end of A/B and the beginning of C. # Another decision tree. Are the extra digits of interest composed solely of '9's, # such as in the example "999123 878 1000001"? # If so, the strategy is somewhat simplified. # This also handles zero '9's, when A and B are of equal length. (?(?=9*\11\ ) # If the extra digits of interest are composed solely of '9's, all we need # to do is pair '9's in A/B with '0's in C, and match a '1' at the start of C. # So, start pairing '9's and '0's. (?:9(?=\d*?\12(\g{-1}?+0)))*? # Stop when we exhaust the extra digits of interest. (?=\11\ ) # Now verify C actually starts with a '1', then match the '0's we've collected, # and also make sure all that follows is \6 (the rest of C). \11\12\g{-1}?+\6$

|

# Now the trickier path. We need to add 1 to extra digits in A/B and match it to C.
# Because we know these extra digits are not composed solely of '9's, we know the
# extra digits in C will be the same length.
#
# How do you check if a number is 1 more than another given they're equal length?
# First, iterate through the digits and match pairs of equivalent digits.
# When you reach a position where they differ, it must be the case that C's
# digit is 1 greater than A/B's. After this point, you need to pair '9's in A/B
# with '0's in C until you exhaust the extra digits of interest.
#
# To see why this last part is necessary, consider the example "4129990 10 4130000".
# When we compare "41299" to "41300", we first match '4' in A to '4' in C, then '1'
# in A to '1' in C. Then we find the point where the next digit in C is 1 greater
# than the next one in A, and pair '2' in A with '3' in C. There can only be exactly
# one such point like this. Afterwards, the only thing that could possibly follow is
# a series of '9's in A and '0's in C until we exhaust the extra digits of interest.

# The first part, consume all equivalent digits.
(?:(\d)(?=\d*\12(\g{-1}?+\g{-2})))*+

# Now we prepare for the next check by setting up a backreference that contains
# everything in between the two digits of interest, for simplicity.
(?=\d(\d*\12\g{-2}?+))

# Match pairs that differ by 1 in favour of C.
(?=0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)\d

# Now consume any and all additional '9's, pairing them with '0's in C.
(?:9(?=\d*\12\g{-2}?+\d(\g{-1}?+0)))*?

# Stop when we exhaust the extra digits of interest.
(?=\11\ )

# Now verify C by checking it contains all extra digits shared with A/B, followed by
# the lone digit that was found to be 1 greater than the corresponding one in A/B,
# then any '0's that followed, and finally \6, the rest of its digits.
\11\12\g{-3}?+\d\g{-1}?+\6$) )) # At this point, we've managed to successfully verify the extra digits in A, B, and C. # We have examined the "12" and "13" in our example of "12345 678 13023" and found them # to be sound. We would have rejected examples such as "11 1 22" and "92 8 110" by now, # since their extra digits don't compute. # # The rest of the logic examines the equi-length portions of A and B (saved as \2 and \4 # respectively). This is actually simpler since we don't have to fuss around with things # being different lengths; we took care of all that earlier. # # At this point, we can simply match pairs of digits in A and B to their sum in C. # There is, however, still some considerations to be made as to carrying. We're iterating # through digits from left to right, after all, and the sum of every pair of digits we # sum in A and B may be found in C as either A+B(mod 10) or 1+A+B(mod 10) depending on # whether carrying occurs to the right. # Consume any extra digits in A; they're no longer important. \1 # Iterate through A, B, and C one digit at a time, from the left. (?: # Here we set up backreferences to useful portions of the subject, ignoring any # leading '0's, and also ignoring those extra digits from before, \3 and \5. # # 18 19 20 21 22 (?=(\d)\d*\ 0*+\3((\g{-2}?+)\d)\d*\ 0*+\5((\g{-2}?+)\d)) # # These values update as we iterate through A, but on the first run, # using our example of "12345 678 13023": # # \18 = "3", ie. the next digit to inspect in A. # \19 = "6", ie. what we've examined in B including the next digit. # \20 = "", ie. what we've examined in B excluding the next digit. # \21 = "0", ie. what we've examined in C including the next digit. # \22 = "", ie. what we've examined in C excluding the next digit. # Like before, we must proceed in one of two directions based on whether or not we # encounter carrying. # # Similar to the first part, in order to determine this, we need to look at the parts # of A and B that follow our current digits of interest. And, as before, we sift through # any pairs of digits that total 9 until we find a pair whose sum is > 9. (?(?! # Consume the current digit of interest in A. \18 # Then start matching pairs of digits in A and B whose sum is 9. (?: # Use nested references to remember how far into B we are. (?=\d+\ 0*+\3\19((\g{-2}?+)\d)) # Set up a backreference for our simple pair matching. (?=\d(\d*\ 0*+\3\19\g{-2})) # Match pairs of digits that sum to 9. (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5 |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0) # Consume a digit to move forward. \d )*+ # All that's left is to check if the next pair of digits in A and B has # a sum exceeding 9. Set up our convenient back reference and check. (?=\d(\d*\ 0*+\3\19\g{-3}?+)) # Now test for a combination of digits whose sum is > 9. (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}|3\g{-1}[7-9]|4\g{-1}[6-9] |6\g{-1}4|7\g{-1}|8\g{-1}[2-4]|9\g{-1}[1-4]) ) # The above negative lookahead succeeded, so fortunately we don't have to contend # with carrying in the first branch. We need to match pairs of digits in A and B # and their sum in C. I don't think there's a more clever way to do this in PCRE # than tabulating all the combinations. # First set up convenient backreferences. (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22)) # Now the ugly part. (?= 1\g{-2}(?:1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0) |2\g{-2}(?:1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1) |3\g{-2}(?:1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2) |4\g{-2}(?:1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3) |5\g{-2}(?:1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4) |6\g{-2}(?:1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5) |7\g{-2}(?:1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6) |8\g{-2}(?:1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7) |9\g{-2}(?:1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8) |0\g{-2}(\d)\g{-2}\g{-1} # At least we can handle zeros |(\d)\g{-4}0\g{-3}\g{-1} # with a bit of intelligence. ) | # And in the else branch, we have to deal with carrying. So we'll match pairs of # digits like up there, but this time we'll match pairs of digits in A and B with # their sum + 1 (mod 10) in C. # Convenient backreferences. (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22)) # Almost done, let's just get through this last bit of ugliness. (?= 1\g{-2}(?:0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1) |2\g{-2}(?:0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2) |3\g{-2}(?:0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3) |4\g{-2}(?:0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4) |5\g{-2}(?:0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5) |6\g{-2}(?:0\g{-1}7|1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6) |7\g{-2}(?:0\g{-1}8|1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7) |8\g{-2}(?:0\g{-1}9|1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8) |9\g{-2}(?:0\g{-1}0|1\g{-1}1|2\g{-1}2|3\g{-1}3|4\g{-1}4|5\g{-1}5|6\g{-1}6|7\g{-1}7|8\g{-1}8|9\g{-1}9) |0\g{-2}(?:0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0) ) ) # Whew. It's over. Consume a digit so we can move forward and restart the fun. \d )+ # At the end of all this, if we can match a space then we have succeeded in matching all # of A, and hence all of B and C. \s | # I tripped up on these edge cases involving zeros the first time I made this. ^0+\ 0*(\d+)\ 0*\g{-1}$

|

# I'm not going to make that mistake again!
^0*(\d+)\ 0+\ 0*\g{-1}\$

)).+


See, it's not so scary after all!

In case anyone is still wondering in earnest why I did this, the simple answer is: it was fun! Once I saw this challenge on StackOverflow's codegolf section and realized it may be possible, I just couldn't stop thinking about it until I had a working solution.

If anyone out there shares my zany passion for things of this nature, I invite you to subscribe to this blog and/or follow me on Twitter. I do have plenty more madness lined up to share with the world.

Also, please know that I do actually spend time creating and helping others create regular expressions that are useful and serve a practical purpose. You are welcome to fire up your favourite IRC client and pop by Freenode's #regex if you ever need any advice or just want to shoot the shit. We've got a great team there who are always happy to help you conquer your regex woes.

1. 1. 2. 