Quick Start
Tutorial
Tools & Languages
Examples
Reference
Book Reviews
RegexBuddy Easily create and understand regular expressions today.
Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now

Named Capturing Groups

Nearly all modern regular expression engines support numbered capturing groups and numbered backreferences. Long regular expressions with lots of groups and backreferences may be hard to read. They can be particularly difficult to maintain as adding or removing a capturing group in the middle of the regex upsets the numbers of all the groups that follow the added or removed group.

Python's re module was the first to offer a solution: named capturing groups and named backreferences. (?P<name>group) captures the match of group into the backreference "name". name must be an alphanumeric sequence starting with a letter. group can be any regular expression. You can reference the contents of the group with the named backreference (?P=name). The question mark, P, angle brackets, and equals signs are all part of the syntax. Though the syntax for the named backreference uses parentheses, it's just a backreference that doesn't do any capturing or grouping. The HTML tags example can be written as <(?P<tag>[A-Z][A-Z0-9]*)\b[^>]*>.*?</(?P=tag)>.

When doing a search-and-replace in Python, you can use \g<name> in the replacement text to insert the text matched by the named capturing group.

The .NET framework also support named capture. Microsoft's developers invented their own syntax, rather than follow the one pioneered by Python and copied by PCRE (the only two regex engines that supported named capture at that time). (?<name>group) or (?'name'group) captures the match of group into the backreference "name". The named backreference is \k<name> or \k'name'. Compared with Python, there is no P in the syntax for named groups. The syntax for named backreferences is more similar to that of numbered backreferences than to what Python uses. You can use single quotes or angle brackets around the name. This makes absolutely no difference in the regex. You can use both styles interchangeably. The syntax using angle brackets is preferable in programming languages that use single quotes to delimit strings, while the syntax using single quotes is preferable when adding your regex to an XML file, as this minimizes the amount of escaping you have to do to format your regex as a literal string or as XML content.

When doing a search-and-replace in .NET, you can use ${name} in the replacement text to insert the text matched by the named capturing group.

Because Python and .NET introduced their own syntax, we refer to both variants as the "Python syntax" and the ".NET syntax" for named capture and named backreferences. Today, many other regex flavors have copied this syntax.

Perl 5.10 added support for both the Python and .NET syntax for named capture and backreferences. It also adds two more syntactic variants for named backreferences: \k{one} and \g{two}. There's no difference between the five syntaxes for named backreferences in Perl. All can be used interchangeably. In the replacement text, you can interpolate the variable $+{name} to insert the text matched by a named capturing group.

PCRE 7.2 and later support all the syntax for named capture and backreferences that Perl 5.10 supports. Old versions of PCRE supported the Python syntax, even though that was not "Perl-compatible" at the time. Languages like PHP, Delphi, and R that implement their regex support using PCRE also support all this syntax. Unfortunately, neither PHP or R support named references in the replacement text. You'll have to use numbered references to the named groups. PCRE does not support search-and-replace at all.

Java 7 and XRegExp copied the .NET syntax, but only the variant with angle brackets. Ruby 1.9 supports both variants of the .NET syntax. The JGsoft flavor supports both the Python and .NET syntax.

Numbers for Named Capturing Groups

Mixing named and numbered capturing groups is not recommended because flavors are inconsistent in how the groups are numbered. If a group doesn't need to have a name, make it non-capturing using the (?:group) syntax. In .NET you can make all unnamed groups non-capturing by setting RegexOptions.ExplicitCapture. With XRegExp, use the "n" flag. With PCRE, set PCRE_NO_AUTO_CAPTURE. If you do that, you can skip this section and save yourself a headache.

Most flavors number both named and unnamed capturing groups by counting their opening parentheses from left to right. Adding a named capturing group to an existing regex still upsets the numbers of the unnamed groups. The only exceptions are the .NET and JGsoft flavors. In .NET, unnamed capturing groups are assigned numbers first, counting their opening parentheses from left to right, skipping all named groups. After that, named groups are assigned the numbers that follow by counting the opening parentheses of the named groups from left to right.

The JGsoft regex engine copied the Python and the .NET syntax at a time when only Python and PCRE used the Python syntax, and only .NET used the .NET syntax. Therefore it also copied the numbering behavior of both Python and .NET, so that regexes intended for Python and .NET would keep their behavior. It numbers Python-style named groups along unnamed ones, like Python does. It numbers .NET-style named groups afterward, like .NET does. These rules apply even when you mix both styles in the same regex.

As an example, the regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this regex and the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor), you will get abcd. All four groups were numbered from left to right, from one till four.

Things are a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. First, the unnamed groups (a) and (c) got the numbers 1 and 2. Then the named groups "x" and "y" got the numbers 3 and 4.

In all other flavors that copied the .NET syntax the regex (a)(?<x>b)(c)(?<y>d) still matches abcd. But in all those flavors, except the JGsoft flavor, the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor) gets you abcd. All four groups were numbered from left to right.

In PowerGREP, which uses the JGsoft flavor, named capturing groups play a special role. Groups with the same name are shared between all regular expressions and replacement texts in the same PowerGREP action. This allows captured by a named capturing group in one part of the action to be referenced in a later part of the action. Because of this, PowerGREP does not allow numbered references to named capturing groups at all. When mixing named and numbered groups in a regex, the numbered groups are still numbered following the Python and .NET rules, like the JGsoft flavor always does.

Multiple Groups with The Same Name

The .NET framework and the JGsoft flavor allow multiple groups in the regular expression to have the same name. All groups with the same name share the same storage for the text they match. Thus, a backreference to that name matches the text that was matched by the group with that name that most recently captured something. A reference to the name in the replacement text inserts the text matched by the group with that name that was the last one to capture something.

Perl and Ruby also allow groups with the same name. But these flavors only use smoke and mirrors to make it look like the all the groups with the same name act as one. In reality, the groups are separate. In Perl, a backreference matches the text captured by the leftmost group in the regex with that name that matched something. In Ruby, a backreference matches the text captured by any of the groups with that name. Backtracking makes Ruby try all the groups.

So in Perl and Ruby, you can only meaningfully use groups with the same name if they are in separate alternatives in the regex, so that only one of the groups with that name could ever capture any text. Then backreferences to that group sensibly match the text captured by the group.

For example, if you want to match "a" followed by a digit 0..5, or "b" followed by a digit 4..7, and you only care about the digit, you could use the regex a(?<digit>[0-5])|b(?<digit>[4-7]). In these four flavors, the group named "digit" will then give you the digit 0..7 that was matched, regardless of the letter. If you want this match to be followed by c and the exact same digit, you could use (?:a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>

Python, Java, and XRegExp do not allow multiple groups to use the same name. Doing so will give a regex compilation error. PCRE and PHP also treat duplicate named groups as an error by default. They have an option to enable duplicate named groups, but it isn't very useful because you can only have backreferences to the first group in the regex with that name.

In Perl and PCRE, it is best to use a branch reset group when you want groups in different alternatives to have the same name, as in (?|a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>. With this special syntax—group opened with (?| instead of (?:—the two groups named "digit" really are one and the same group.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Branch Reset Groups
Free-Spacing & Comments
Unicode
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP 4
PowerGREP PowerGREP is probably the most powerful regex-based text processing tool available today. A knowledge worker's Swiss army knife for searching through, extracting information from, and updating piles of files.
Use regular expressions to search through large numbers of text and binary files. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files or folders, or scan entire drives and network shares.
Search and replace using text, binary data or one or more regular expressions to automate repetitive editing tasks. Preview replacements before modifying files, and stay safe with flexible backup and undo options.
Use regular expressions to rename files, copy files, or merge and split the contents of files. Work with plain text files, Unicode files, binary files, compressed files, and files in proprietary formats such as MS Office, OpenOffice, and PDF. Runs on Windows 2000, XP, Vista, 7, 8, and 8.1.
More information
Download PowerGREP now