RegexBuddy—The best regular expression debugger!

Runaway Regular Expressions: Too Many Repetitions

When a regular expression contains a repeated group such as ^(one|two)*done$ then it has two alternatives to try for each repetition of the group. In theory this regex should match a string with an arbitrarily long sequence of oneoneone…done. In practice, a backtracking regex engine will have to give up at some point.

If the regex engine uses a recursive algorithm then each repetition of the group adds a call to the engine’s call stack. The engine will give up or even crash when the number of repetitions it actually needs to make exceeds the available stack space.

Even if the engine is non-recursive, it still has to keep track of where the group’s match attempt started with each repetition. This is needed so the engine can backtrack if the remainder of the regex fails to match. When trying to match oneoneone the engine repeats the group 3 times. It stores a backtracking position for the quantifier before each repetition. The alternation operator also stores a backtracking position at each attempt. After the 3 repetitions done fails to match at the end of the string. Now the engine goes back through the 6 backtracking positions in reverse order. It attempts two at each backtracking position of the alternation operator. It attempts done at each backtracking position of the quantifier. The process is entirely linear. Even with a string that repeats one thousands of times the regex will match or fail to match instantly, depending on whether there’s a done at the end of the string.

But if you use this regex on a string that repeats one millions of times then you may run into limitations of the regex engine designed to stop it from running out of memory trying to remember all those backtracking positions. This can prevent the engine from finding extremely long matches.

The above example makes matters worse by using a capturing group. At the end of a successful match, the capturing group will hold the last occurrence of one or two in the string. But during the match attempt, it also holds the most recent match of one or two with each iteration. This causes extra work for the regex engine with each iteration of the group and each time the group is backtracked into.

Optimize with Non-Capturing and Atomic Groups

We can optimize this regex to reduce the amount of work the regex engine has to do. These are good techniques to apply to all your regexes. Even if you’ll only use your regexes on shorter strings where the performance difference is hardly measurable, you should treat them as good coding habits.

First of all, only use capturing groups if you really want to capture part of the regex match. Otherwise you can always use a non-capturing group. Turning a capturing group that doesn’t have a backreference into a non-capturing group never changes what the regex is supposed to match. So ^(?:one|two)*done$ is our first optimization.

In this case, the two alternatives are mutually exclusive. two can never match at a position where one has already matched. So all that backtracking is unnecessary. We can tell the regex engine that by using an atomic group: ^(?>one|two)*done$. Now the regex engine throws away the backtracking position of the alternation operator each time it repeats the group. It no longer attempts two at a position where one has already matched.

Optimize with Possessive Quantifiers

But the quantifier * still backtracks. ^(?>one|two)*done$ still attempts done at every position where one has matched as the quantifier backtracks. To stop this we can make the quantifier possessive: ^(?>one|two)*+done$. This regex does not backtrack at all.

Whether this regex can really match an arbitrarily long sequence of oneoneone…done depends on how the regex engine implements possessive quantifiers. If the possessive quantifier does not store backtracking positions at all, then it can. But in some regex flavors the possessive quantifier is another way of writing ^(?>(?>one|two)*)done$. In that case, the quantifier still stores all its backtracking positions, only to throw them away when the regex engine exits the outer atomic group. This does improve performance when done fails to match. But it doesn’t allow longer successful matches if the regex engine is limited by the number of backtracking positions that the quantifier can store.

Eliminate Needless Groups

Even better than turning capturing groups into non-capturing or atomic groups is to eliminate unnecessary groups. People sometimes needlessly group regex tokens because they do not understand the precedence of operators in a regex. The alternation operator has the lowest precedence of all. It alternates between everything to the left of it and everything to the right of it within the regex or group that contains it. A quantifier has high precedence. It repeats just the token or group in front of it.

So one|two* would match one, tw, two, twoo, twooo, etc. We needed the group to repeat the alternation instead of just the final o. But we don’t need extra groups around the alternatives. The two nested groups in (?:(?:one)|(?:two))* are unnecessary.

(?:[ab])+ also has a needless group. The character class is treated as a single regex token. The quantifier can repeat it directly: [ab]+.

How much an impact these unnecessary groups have depends on the regex engine. If you followed the advice to use non-capturing groups then the engine may be able to optimize away the unnecessary groups. But it can’t do that if you have unnecessary capturing groups. The regex engine can’t know whether you’ll need to retrieve the text matched by the capturing groups afterwards, so it can’t remove them.

Repeat Single Tokens

In theory, ^(?:a|b|c)+$ and ^[abc]+$ are the same. In practice, most backtracking engines execute the latter much faster. The character class can attempt both characters at the same time. It doesn’t need to backtrack at all to try the other characters like the alternation operator does. Each iteration of the character class matches exactly one character. While the quantifier may need to backtrack, it doesn’t need backtracking positions to do so. It just needs to remember the number of iterations. It can backtrack simply by stepping backwards one character in the string and decrementing the number of iterations. This enables ^[abc]+$ to match a string of any length.

"(?:[^"]|\\.)*" is a simplistic solution to match a double-quoted string that may contain double quotes if they are escaped by a backslash. We allow line breaks and assume the dot matches them.

The above regex is correct in the sense that it matches all double-quoted strings and nothing else. But it is simplistic because it performs poorly. We could use an atomic group if we flipped the two alternatives (remember the regex engine is eager). But unless the regex engine has possessive quantifiers that don’t store backtracking positions at all, we’re not going to be able to make this regex match double-quoted strings that are millions of characters long as long as we’re repeating the group for each character in the string.

To optimize this regex we need to repeat the negated character class: "(?:[^"\\]+|\\.)*". This allows the regex to quickly match runs of non-escaped characters within the string. Since those are far more common than escaped characters, this significantly reduces the number of backtracking positions the regex engine needs to remember. The outer group only repeats once for each run of non-escaped characters and once for each escaped character. The two quantifiers will still backtrack if the closing quote fails to match. But most iterations will be of the inner quantifier which can backtrack much faster.

Note that the negated character class now includes the backslash. This ensures the two alternatives are mutually exclusive. This is essential. If you paid attention to the catastrophic backtracking topic then you’ll notice a similar pattern of nested quantifiers. Though our regex will backtrack when the closing quote fails to match, it does so linearly because the second alternative can never match a character that was matched by the first alternative.

We can take this optimization one step further. We don’t need to repeat the group for runs of non-escaped characters and we don’t need it to alternate it between escaped and non-escaped characters. We only need the group to handle escaped characters. "[^"\\]*(?:\\.[^"\\]*)*" treats a double-quoted string as a series of zero or more non-escaped characters followed by zero or more escaped characters that are each followed by zero or more non-escaped characters. Now the group only remembers one backtracking position for each non-escaped character in the string. This enables the regex to match strings of pretty much any length. It’ll only run into regex engine limitations if a string should contain millions of escaped characters. It will backtrack if the closing quote fails to match. But all the backtracking attempts will immediately fail because the group starts with \\. which is mutually exclusive with [^"\\]*.

If the regex engine does support atomic grouping or possessive quantifiers then we can put the icing on the cake with "(?>[^"\\]*(?>\\.[^"\\]*)*)" or "[^"\\]*+(?:\\.[^"\\]*+)*+". Both these regexes throw away all backtracking positions when attempting to match the closing double quote. So they never backtrack at all.