Quick Start
Tutorial
Tools & Languages
Examples
Reference
Book Reviews
RegexBuddy Easily create and understand regular expressions today.
Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now

Regular Expression Recursion

Perl 5.10, PCRE 4.0, Ruby 2.0, and all later versions of these three, support regular expression recursion. Perl uses the syntax (?R) with (?0) as a synonym. Ruby 2.0 uses \g<0>. PCRE supports all three as of version 7.7. Earlier versions supported only the Perl syntax (which Perl actually copied from PCRE). Recent versions of Delphi, PHP, and R also support all three, as their regex functions are based on PCRE. While Ruby 1.9 does not have any syntax for regex recursion, it does support capturing group recursion. So you could recurse the whole regex in Ruby 1.9 if you wrap the whole regex in a capturing group. .NET does not support recursion, but it supports balancing groups that can be used instead of recursion to match balanced constructs.

The regexes a(?R)?z, a(?0)?z, and a\g<0>z all match one or more letters a followed by exactly the same number of letters z. Since these regexes are functionally identical, we'll use the syntax with R for recursion to see how this regex matches the string aaazzz.

First, a matches the first a in the string. Then the regex engine reaches (?R). This tells the engine to attempt the whole regex again at the present position in the string. Now, a matches the second a in the string. The engine reaches (?R) again. On the second recursion, a matches the third a. On the third recursion, a fails to match the first z in the string. This causes (?R) to fail. But the regex uses a quantifier to make (?R) optional. So the engine continues with z which matches the first z in the string.

Now, the regex engine has reached the end of the regex. But since it's two levels deep in recursion, it hasn't found an overall match yet. It only has found a match for (?R). Exiting the recursion after a successful match, the engine also reaches z. It now matches the second z in the string. The engine is still one level deep in recursion, from which it exists with a successful match. Finally, z matches the third z in the string. The engine is again at the end of the regex. This time, it's not inside any recursion. Thus, it returns aaazzz as the overall regex match.

Quantifiers On Recursion

The quantifier ? makes the preceding token optional. In other words, it repeats the token between zero or one times. In a(?R)?z the (?R) is made optional by the ? that follows it. You may wonder why the regex attempted the recursion three times, instead of once or not at all.

The reason is that upon recursion, the regex engine takes a fresh start in attempting the whole regex. All quantifiers and alternatives behave as if the matching process prior to the recursion had never happened at all, other than that the engine advanced through the string. The regex engine restores the states of all quantifiers and alternatives when it exits from a recursion, whether the recursion matched or failed. Basically, the matching process continues normally as if the recursion never happened, other than that the engine advanced through the string.

If you're familiar with procedural programming languages, regex recursion is basically a recursive function call and the quantifiers are local variables in the function. Each recursion of the function gets its own set of local variables that don't affect and aren't affected by the same local variables in recursions higher up the stack.

This means that the regex a(?R)z without the quantifier would forever try to match another a after each a. This regex never finds any matches, because (?R) always fails to match after the last a in the string was matched. For your regex to actually find a match, each instance of (?R) must be optional. You can achieve this with a quantifier on the (?R) or on the group that it contains. You can also achieve this by adding an alternative that does not contain (?R).

You can use all quantifiers on recursion. Let's see how a(?R){3}z|q behaves. The simplest possible match is q, found by the second alternative in the regex.

The simplest match in which the first alternative matches is aqqqz. After a is matches, the regex engine begins a recursion. a fails to match q. Still inside the recursion, the engine attempts the second alternative. q matches q. The engine exits from the recursion with a successful match. The engine now notes that the quantifier {3} has successfully repeated once. It needs two more repetitions, so the engine begins another recursion. It again matches q. On the third iteration of the quantifier, the third recursion matches q. Finally, z matches z and an overall match is found.

This regex does not match aqqz or aqqqqz. aqqz fails because during the third iteration of the quantifier, the recursion fails to match z. aqqqqz fails because after a(?R){3} has matched aqqq, z fails to match the fourth q.

The regex can match longer strings such as aqaqqqzqz. With this string, during the second iteration of the quantifier, the recursion matches aqqqz. Since each recursion tracks the quantifier separately, the recursion needs three consecutive recursions of its own to satisfy its own instance of the quantifier. This can lead to arbitrarily long matches such as aaaqqaqqqzzaqqqzqzqaqqaaqqqzqqzzz.

Matching Balanced Constructs

The main purpose of recursion is to match balanced constructs or nested constructs. The generic regex is b(?:m|(?R))*e where b is what begins the construct, m is what can occur in the middle of the construct, and e is what can occur at the end of the construct. For correct results, no two of b, m, and e should be able to match the same text. You can use an atomic group instead of the non-capturing group for improved performance: b(?>m|(?R))*e.

A common real-world use is to match a balanced set of parentheses. \((?>[^()]|(?R))*\) matches a single pair of parentheses with any text in between, including an unlimited number of parentheses, as long as they are all properly paired. If the subject string contains unbalanced parentheses, then the first regex match is the leftmost pair of balanced parentheses, which may occur after unbalanced opening parentheses. If you want a regex that does not find any matches in a string that contains unbalanced parentheses, then you need to use a subroutine call instead of recursion. If you want to find a sequence of multiple pairs of balanced parentheses as a single match, then you also need a subroutine call.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Branch Reset Groups
Free-Spacing & Comments
Unicode
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP 4
PowerGREP PowerGREP is probably the most powerful regex-based text processing tool available today. A knowledge worker's Swiss army knife for searching through, extracting information from, and updating piles of files.
Use regular expressions to search through large numbers of text and binary files. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files or folders, or scan entire drives and network shares.
Search and replace using text, binary data or one or more regular expressions to automate repetitive editing tasks. Preview replacements before modifying files, and stay safe with flexible backup and undo options.
Use regular expressions to rename files, copy files, or merge and split the contents of files. Work with plain text files, Unicode files, binary files, compressed files, and files in proprietary formats such as MS Office, OpenOffice, and PDF. Runs on Windows 2000, XP, Vista, 7, 8, and 8.1.
More information
Download PowerGREP now