Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion and Subroutines
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
RegexBuddy—Better than a regular expression tutorial!

Lookahead and Lookbehind Zero-Length Assertions

Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

Positive and Negative Lookahead

Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.

Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular expression can be used inside the lookahead. If it contains capturing groups then those groups capture as normal and backreferences to them work normally, even outside the lookahead. (The only exception is Tcl, which treats all groups inside lookahead as non-capturing.) The lookahead itself is not a capturing group. It is not included in the count towards numbering the backreferences. If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)). The other way around doesn’t work, because the lookahead has already discarded the regex match by the time the capturing group gets to store its match.

Control Verb Syntax

Perl 5.28 and PCRE2 10.33 introduced new syntax for lookaround in an attempt to make regexes with lookaround easier to read. PHP 7.4.0 and R 4.0.0 gained this syntax from PCRE2. The syntax is similar to that of backtracking control verbs. But otherwise it has nothing to do with control verbs. Lookaround with the new syntax works exactly the same as the classic syntax. All other flavors discussed below only support the classic syntax.

Long SyntaxShort SyntaxClassic Syntax
(*positive_lookahead:regex)(*pla:regex)(?=regex)
(*negative_lookahead:regex)(*nla:regex)(?!regex)
(*positive_lookbehind:regex)(*plb:regex)(?<=regex)
(*negative_lookbehind:regex)(*nlb:regex)(?<!regex)

Regex Engine Internals

First, let’s see how the engine applies q(?!u) to the string Iraq. The first token in the regex is the literal q. As we already know, this causes the engine to traverse the string until the q in the string is matched. The position in the string is now the void after the string. The next token is the lookahead. The engine takes note that it is inside a lookahead construct now, and begins matching the regex inside the lookahead. So the next token is u. This does not match the void after the string. The engine notes that the regex inside the lookahead failed. Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and q is returned as the match.

Let’s try applying the same regex to quit. q matches q. The next token is the u inside the lookahead. The next character is the u. These match. The engine advances to the next character: i. However, it is done with the regex inside the lookahead. The engine notes success, and discards the regex match. This causes the engine to step back in the string to u.

Because the lookahead is negative, the successful match inside it causes the lookahead to fail. Since there are no other permutations of this regex, the engine has to start again at the beginning. Since q cannot match anywhere else, the engine reports failure.

Let’s take one more look inside, to make sure you understand the implications of the lookahead. Let’s apply q(?=u)i to quit. The lookahead is now positive and is followed by another token. Again, q matches q and u matches u. Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts fail as well, because there are no more q’s in the string.

The regex q(?=u)i can never match anything. It tries to match u and i at the same position. If there is a u immediately after the q then the lookahead succeeds but then i fails to match u. If there is anything other than a u immediately after the q then the lookahead fails.

Positive and Negative Lookbehind

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a “b” that is not preceded by an “a”, using negative lookbehind. It doesn’t match cab, but matches the b (and only the b) in bed or debt. (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.

The construct for positive lookbehind is (?<=text): a pair of parentheses, with the opening parenthesis followed by a question mark, less than sign, and an equals sign. Negative lookbehind is written as (?<!text), using an exclamation point instead of an equals sign.

More Regex Engine Internals

Let’s apply (?<=a)b to thingamabob. The engine starts with the lookbehind and the first character in the string. In this case, the lookbehind tells the engine to step back one character, and see if a can be matched there. The engine cannot step back one character because there are no characters before the t. So the lookbehind fails, and the engine starts again at the next character, the h. (Note that a negative lookbehind would have succeeded here.) Again, the engine temporarily steps back one character to check if an “a” can be found there. It finds a t, so the positive lookbehind fails again.

The lookbehind continues to fail until the regex reaches the m in the string. The engine again steps back one character, and notices that the a can be matched there. The positive lookbehind matches. Because it is zero-length, the current position in the string remains at the m. The next token is b, which cannot match here. The next character is the second a in the string. The engine steps back, and finds out that the m does not match a.

The next character is the first b in the string. The engine steps back and finds out that a satisfies the lookbehind. b matches b, and the entire regex has been matched successfully. It matches one character: the first b in the string.