Tuesday 7 November 2017

Match Nested Brackets with Regex: A new approach

My first blog post was a bit of a snoozefest, so I feel I ought to make this one a little shorter and more to the point. I'm going to show you how to do something with regular expressions that's long been thought impossible. Get excited.

The problem


You want to match a full outer group of arbitrarily nested parentheses with regex but you're using a flavour such as Java's java.util.regex that supports neither recursion nor balancing groups.

The solution


(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof: Java Regex or PCRE on regex101 (look at the full matches on the right)

Et voila; there you go. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!


OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?


I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:

Component
Description




(?=\()
Make sure '(' follows before doing any hard work.

(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So "(?!.*?\1)" ensures we don't match the same '(' again.
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid used to overcome this bug.
)

Handle ')'
(?=
This lookahead deals with finding the next ')'.
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above-mentioned bug is not applicable here, so this simple expression is sufficient.
)


.
Consume a single character so that the group can continue matching. It is safe to consume a character here because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.

)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check.
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).


Conclusion


So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regex features - no recursion or balancing groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

If you share my excitement for things of this nature then I encourage you to follow this blog, as I have a few more pearls of regex wisdom to offer in good time. Also in the cards is a regex quiz adventure that will test your skills in a variety of interesting and challenging ways. So please do stay tuned.

I'd like to thank Me-me on Freenode for inspiring this discovery with a clever attempt at one of my #regex challenges. Thank you for being man enough to attempt them!

Bonus: Optimization!


Just for fun, I managed to improve the performance of this technique from "disgustingly, horrendously awful" to just "awful". Using the example in the proofs above:

Original (16,257 steps):

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Optimized (4,205 steps):

(?=\()(?:(?=(?(1).*?(?=\1)).*?\((.*))(?=(?(2).*?(?=\2)).*?\)(.*)).)+?(?>.*?(?=\1))[^(]*?(?=\2$)

And just so you can get an idea of how poor this method truly is, here's how the conventional recursive method measures up:

Recursive (445 steps):

\((?:[^()]+|(?R))*+\)

A whole order of magnitude more efficient.

5 comments:

  1. Holy shit(2). Indeed it works

    ReplyDelete
  2. Is it possible to implement this new approach with JavaScript RegEx?

    ReplyDelete
  3. how to get it to work with a leading (?im) (case-insensitive, multi-line)?

    Yes, I'm using it to grok code out of sys.sql_modules...

    ReplyDelete
  4. The "regular expressions" packages that allow back-references and other enhanced tools are in fact way more expressive than regular expressions in the narrow computer-science sense. For instance, the search pattern `(\w+)(\1)+`, which matches two or more repetitions of a word of arbitrary length, cannot be expressed as a narrow-sense regular expression, or even as a context-free grammar. There is some theory out there that goes into more details, such as https://researchspace.csir.co.za/dspace/bitstream/handle/10204/10837/Berglund_22117_2017.pdf?sequence=1

    ReplyDelete