Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion and Subroutines
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
RegexBuddy—Better than a regular expression tutorial!

Regular Expression Backtracking Control Verbs

Backtracking control verbs are only supported by Perl, PCRE 7.3 and later, PCRE2, and Boost 10.60 and later. You can also use them in Delphi, R, and PHP 5.2.5 and later as their regex support is based on PCRE or PCRE2.

We call them backtracking control verbs because they change how the regex engine backtracks. In particular, they make the regex engine do less backtracking. If you have a deep understanding of how the regex engine works then you can use them to optimize certain regular expressions by making them fail faster. So for every control verb, this topic will walk you through the regex engine internals of a sample regex.

The verbs can also change the matches found by the regex. Eliminating backtracking also eliminates the matches that could be found by backtracking.

Backtracking control verbs are case sensitive. (*fail) is a syntax error.

FAIL or F

(*FAIL) and (*F) are synonyms. When encountered during the matching process, they tell the regex engine to fail to match. You can use them instead of the empty negative lookahead (?!) as a token that always fails to match. The regex engine backtracks normally, just like it does after any token that fails to match at the position or character that the regex engine has reached in the string.

a(*F)b|c matches only c when applied to abc. First a matches a. But then (*F) fails to match. The regex engine backtracks and attempts the next alternative c. This fails to match b. The engine advances one character in the string. a fails to match b. The engine backtracks. c also fails to match b. The engine advances another character in the string. a fails to match c. The engine backtracks. c matches c.

This tutorial uses (?!) when explaining how balancing groups can be used to match properly nested pairs. The conditional (?(open)(?!)) at the end of ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ checks whether the group open has any captures left. If it does then (?!) forces the regex to fail because we don’t want it to match an unbalanced pair of o and c.

If .NET had supported backtracking control verbs then we could have written this regex as ^(?'open'o)+(?'-open'c)+(?(open)(*FAIL))$ to make it slightly more legible.

ACCEPT

When (*ACCEPT) is encountered during the matching process, it tells the regex engine to abort the match attempt and accept the match so far as the overall match. ab(*ACCEPT)cd matches ab in the strings ab, abcd, and abxy.

When this regex is applied to any of these strings, ab matches ab at the start of the string as usual. The engine then encounters (*ACCEPT) which tells the regex engine that time’s up and the result it’s got so far will have to do. ab becomes the overall match. The remainder of the regex is never evaluated.

The regex engine can continue with another match attempt if you want it to find all matches such as by adding the /g flag in Perl. ab(*ACCEPT)cd matches all 3 occurrences of ab in the string ababcdabxy.

When (*ACCEPT) is encountered inside a capturing group then that group also captures the text it has matched so far. When a(b(*ACCEPT)c)d matches ab the capturing group holds b. PCRE 7.3 to 7.9 did not do this. They left the capturing group empty.

COMMIT

When (*COMMIT) is encountered while backtracking, it tells the regex engine to commit to the failure by aborting the entire matching process. The engine discards all backtracking positions and stops advancing through the string entirely.

Whereas (*FAIL) and (*ACCEPT) take effect immediately when they are encountered during the normal matching process, (*COMMIT) only takes effect while backtracking. All following control verbs also take effect only when they’re backtracked into. xy|ab(*COMMIT)cd matches xy and abcd normally. When applied to the latter string, x fails to match a. The engine backtracks. ab matches ab. Then the regex engine encounters (*COMMIT). It does nothing with it other than remember it as a backtracking position in case the remainder of the regex fails. cd then matches cd. The engine has reached the end of the regex. It returns abcd as the overall match. There is no need to backtrack.

xy|ab(*COMMIT)cd fails to match ababcdxy. First, x fails to match a at the start of the string. The engine backtracks. ab matches ab. Then the regex engine encounters (*COMMIT). Again, the engine does nothing with it other than remember this as a backtracking position in case the remainder of the regex fails. Then c does fail to match the second a. Now the engine backtracks. The last remembered backtracking position was (*COMMIT). This now takes effect. The engine commits to failure and declares that the regex can’t match the string at all. It does not advance through the string.

ab+(*COMMIT)d+|.{2,3} fails to match abbbbcccc. You might think that .{2,3} match any 2 or 3 characters. But no, (*COMMIT) prevents that. First, a matches a. Then b+ matches bbb. (*COMMIT) is reached and pushed onto the backtracking stack. Then d fails to match c. The engine backtracks, popping (*COMMIT) off the backtracking stack. The engine discards all backtracking positions. It never tries the second alternative in the regex. It also stops advancing through the string. It is done. No match found.

This regex would match the strings xyz and ax. If the first alternative fails without ever reaching (*COMMIT) then the regex backtracks normally and attempts the second alternative.

SKIP

When (*SKIP) is encountered while backtracking, it tells the regex engine to skip ahead in the subject string by restarting the matching process at the position where (*SKIP) was reached.

ab+(*SKIP)d+|.{2,3} matches ccc in abbbbcccc. Again, you might think that .{2,3} match any 2 or 3 characters. But no, (*SKIP) forced the regex engine to skip ahead.

a still matches a and b+ still matches bbb. But with this regex, (*SKIP) is reached and pushed onto the backtracking stack. The position that the regex engine has reached in the string, between the last b and the first c, is noted along with that backtracking entry. Then d fails to match c. The engine backtracks, popping (*SKIP) along with its remembered position off the backtracking stack. The engine discards all backtracking positions. It does not try the second alternative in the regex. It does restart the matching process at the beginning, after advancing through the string. But instead of advancing by a single character, as the engine normally does after a failed match attempt, it starts the next match attempt at the position that was remembered by (*SKIP). In this example, that is between the last b and the first c. Restarting the match attempt there, a fails to match c. The engine backtracks, trying the second alternative. .{2,3} matches ccc, which is the overall match.

Note that this match will be found even if you do not specify the /g flag in Perl. The previous paragraph describes only one match attempt. It’s no different than a simple regex like c+ advancing through the string 5 times to find its first and only match cccc in the same string.

PRUNE

When (*PRUNE) is encountered while backtracking, it tells the regex engine to prune the backtracking tree. But the matching process continues. With all backtracking positions pruned, the only thing left to do is to advance one character in the string and restart the regex from the beginning.

ab+(*PRUNE)d+|.{2,3} matches bbb in abbbbcccc. One again, a still matches a and b+ still matches bbb. This time, (*PRUNE) is reached and pushed onto the backtracking stack. Then d fails to match c. The engine pops (*PRUNE) off the backtracking stack which forces it to erase the entire stack, forgetting the backtracking positions of b+ and the alternation. With nothing to backtrack to, the regex engine concludes that the regex cannot find a match beginning at the start of the string. But it can advance through the string. It does so by a single character, as it normally does when all permutations of a regex have failed. Now a fails to match the first b. The engine backtracks to the second alternative. .{2,3} matches bbb which becomes the overall match.

THEN (Not Boost)

When (*THEN) is encountered while backtracking, it tells the regex engine to discard the backtracking positions of the current alternative and then proceed with the next alternative. When (*THEN) is inside a group, it operates on the alternatives inside the innermost parent group that has alternation. If it is in the last alternative of that group then it discards the backtracking positions of that group and proceeds with the part of the regex after the group. When no parent groups have alternation it operates on the alternatives of the overall regex. If the overall regex has no alternatives or (*THEN) is in the last alternative then it acts like (*PRUNE).

A simpler way of putting all this is to say that “if we end up backtracking in this alternative then skip ahead to the next alternative”. At least, that’s how it works in Perl, PCRE, and PCRE2.

ab+(*THEN)d+|.{2,3} matches abb in abbbbcccc. Once more, a matches a and b+ matches bbb. (*THEN) is reached and pushed onto the backtracking stack. d fails to match c. The engine backtracks, popping (*THEN) off the backtracking stack. This tells it to discard the backtracking positions of the current alternative, but not the alternation itself. Thus b+ is not backtracked. But the alternation itself is. The engine proceeds with .{2,3} which matches abb.

We could have achieved the same with ab++d+|.{2,3}. The possessive quantifier b++ would not have stored the backtracking positions in the first place. In this case, neither (*THEN) nor the possessive quantifier change the matches found by the regex, because b and d are mutually exclusive. They only provide a slight performance improvement, preventing b+ from backtracking unnecessarily.

THEN (Boost)

When Boost encounters (*THEN) during backtracking, it discards all the backtracking positions of the innermost parent group that has alternation, including the alternation itself. So instead of continuing with the next alternative in that group like Perl and PCRE do, Boost continues with the part of the regex after that group, even if the group has alternatives following (*THEN). If there is no such parent group then Boost handles (*THEN) in the same way as (*PRUNE), discarding all backtracking positions.

Atomic Grouping

When (*ACCEPT) is encountered inside an atomic group, Perl and Boost only make the atomic group accept its match. The remainder of the regex is then attempted normally. a(?>b(*ACCEPT)c)d matches abd. First, a matches a. The engine enters the atomic group. b matches b. The engine encounters (*ACCEPT). The atomic group accepts b as its overall match. The engine continues with the remainder of the regex after the group. d matches d. abd becomes the overall match. The same regex cannot match ab or abc because d fails to match after the atomic group has accepted b.

The regex a(?>b(*ACCEPT)c)d|.. does match ab in the string abc. After the atomic group is made to accept b as its match, d fails to match c. Because (*ACCEPT) is confined to the atomic group, the engine has not discarded the backtracking positions outside the atomic group. The engine backtracks, going back to the start of the string and attempting . which matches a. The second . matches b. ab is the overall match.

PCRE 7.3 also confined (*ACCEPT) to atomic groups. But PCRE 8.10 changed this, changing its behavior to be different from Perl. PCRE2 has kept this new behavior. With PCRE 8.10 and later and PCRE2, there is no difference between a(?>b(*ACCEPT)c)d and ab(*ACCEPT)cd. Both match ab. (*ACCEPT) forces the whole regex to accept its match even when (*ACCEPT) is inside an atomic group.

The other control verbs affect the overall regex even if they are inside an atomic group. However, if the engine exits the atomic group then it discards all the backtracking positions in the atomic group, including those of any control verbs. So the atomic group can prevent the control verb from having any effect at all.

ab+(?>(*COMMIT)d+)|.{2,3} fails to match abbbbcccc. When d fails to match c the regex engine is still inside the atomic group. So (*COMMIT) is still on the backtracking stack. It is backtracked into and the engine commits to failing the entire regex.

But a(?>b+(*COMMIT))d+|.{2,3} matches abb, bbc, and ccc in abbbbcccc. After b+ matches bbbb, (*COMMIT) is pushed onto the backtracking stack. But then the regex engine exits the atomic group. It discards all backtracking positions remembered inside the group. In this example, that is the position for (*COMMIT) and all the positions for b+. When d fails to match c the engine backtracks. The only position still left on the backtracking stack is the one for the alternation. Thus the regex engine attempts .{2,3} which matches abb. When finding all matches, such as with the /g flag in Perl, the engine starts the second match attempt at the end of this match. a fails to match the 3rd b in the string. The engine backtracks. The only position that ever made it onto the stack during the second match attempt is the alternation. .{2,3} matches bbc. This process repeats for the 3rd and 4th matches.

On the topic of atomic grouping, (*atomic:group) is not a backtracking control verb. It is an atomic group using control verb syntax. Its behavior is identical to (*atomic:group). This syntax is supported by Perl 5.28 and later and PCRE2 10.33 and later.

Recursion and Lookaround

When you use backtracking control verbs inside recursion and lookaround, their behavior depends on exactly how recursion and lookaround are implemented by each regex engine. That is beyond the scope of this regex tutorial. These behaviors are undocumented, are possibly subject to change, and are inconsistent between the regex engines that support backtracking control verbs.

Lookaround too has alternative syntax that looks like control verb syntax in Perl 5.28 and PCRE2 10.33.