Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Unicode Regexes
Introduction
Astral Characters
Code Points and Graphemes
Unicode Categories
Unicode Scripts
Unicode Blocks
Unicode Binary Properties
Unicode Property Sets
Unicode Script Runs
Unicode Boundaries
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Infinite Recursion
Recursion & Quantifiers
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Unicode Characters Beyond U+FFFF

Unicode was originally designed as a 16-bit character set. But that turned out to be insufficient if we want Unicode to support every character from every script from all of history and the future. So Unicode was extended to allow code points up to U+10FFFF. Unicode 3.1.0 assigned the first code points beyond U+FFFF way back in 2001. This included the 𝔪𝔞𝔱𝔥 𝓼𝔂𝓶𝓫𝓸𝓵𝓼 that are often used for more fanciful purposes and over 40,000 ideographs that may be important to your users in the Far East. Later versions of Unicode added many more characters, including everybody’s favorite emoji 😺🐶 in Unicode 6.0.0.

Unicode code points are organized into 17 planes numbered from 0 to 16. Only 7 of them are defined at the moment. It doesn’t really matter which plane a character is in beyond whether it is in plane 0 or in a higher plane. Characters in higher planes are collectively referred to as “astral” characters.

PlaneNameNumberCode points
BMPBasic Multilingual Plane0U+0000–U+FFFF
SMPSupplementary Multilingual Plane1U+10000–U+1FFFF
SIPSupplementary Ideographic Plane2U+20000–U+2FFFF
TIPTertiary Ideographic Plane3U+30000–U+3FFFF
SSPSupplementary Special-purpose Plane14U+E0000–U+EFFFF
SPUA-ASupplementary Private Use Area-A15U+F0000–U+FFFFF
SPUA-BSupplementary Private Use Area-B16U+100000–U+10FFFF

But a lot of software, including Windows itself, is still designed around 16-bit characters. The UTF-16 encoding is designed to enable such software to handle astral characters. Code points U+D800–DBFF are reserved as “high surrogates” and code points U+DC00–DFFF are reserved as “low surrogates”. These code points should never appear as characters in Unicode files. UTF-16 uses them to encode astral characters as surrogate pairs consisting of one high surrogate followed by one low surrogate. Code point U+1F989 which represents the owl emoji 🦉, for example, is encoded as 0xD83E 0xDD89.

With most regex engines, you don’t need to worry about this. They only every see astral characters as individual characters, regardless of whether they are encoded as UTF-16 using surrogate pairs or as UTF-8 or UTF-32 which don’t use surrogates.

But some regex engines operate entirely on 16-bit code points. They see 🦉 as two separate 16-bit code points U+D83E U+DD89. This has significant implications which are explained in this topic. All other topics in this tutorial assume that your regex engine either sees 🦉 as a single code point U+1F989 or that the issue is irrelevant because your subject string does not contain any astral characters at all. The reality is that matching astral characters with a regex engine based on 16-bit characters is problematic.

.NET, std::wregex, and boost::wregex always operate on 16-bit code points. JavaScript operates on 16-bit code points without the /u flag but handles astral characters with the /u flag. The re module in Python 3.3 and later supports astral characters, while in Python 3.2 and prior it operated on 16-bit code points. All other flavors discussed in this tutorial either handle astral characters as such, or are 8-bit engines that can never encounter astral characters.

Dot

If the regex engine operates on 16-bit code points then the dot matches high and low surrogates separately. Which such an engine you need two dots .. or a repeated dot .+ to properly match the owl emoji. With an engine that treats code points beyond U+FFFF as single characters (regardless of how it is encoded internally) then . would match the owl emoji and .. would fail to match 🦉 because the string has only one character.

A quick way to test the application you’re using is to try the regex ^.. on the string 🦉🦉. If it matches both owls then the application’s regex engine supports astral characters, with each dot matching one of them. If it matches only the first owl then it operates on 16-bit code points, with the two dots matching the surrogate pair representing the first owl.

This issue is important if you’re trying to use the dot to match a single character or a specific number of characters. You can use .*? to skip over any number of characters regardless of your regex engine. You can use [^\uD800-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF] to match one character in a file that may contain characters outside the BMP with an engine that operates on UTF-16 code points. Use (?:[^\uD800-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]){3} to match 3 characters, for example.

Shorthand Classes

Boost and .NET match Unicode BMP characters with shorthand character classes. But they cannot match astral characters with shorthands because they see those as surrogate pairs, and the shorthands don’t match surrogates. So while U+1D400 represents a letter 𝐀 it is not matched by \w in .NET or Boost. The negated shorthand \W, however, does match each of the surrogates representing 𝐀. \W\W matches the full surrogate pair U+D835 U+DC00. All this also applies to std::wregex which only matches ASCII characters with shorthands.

JavaScript only matches ASCII characters with \d and \w, but Unicode characters with \s, regardless of whether you add the /u flag to the regexp. But \s never includes any astral characters in any regex flavor because Unicode does not categorize any astral characters as whitespace. So none of these shorthands match astral characters in JavaScript. Their negated counterparts \D and \W and \S match astral characters as a whole when you use the /u flag. They match individual surrogates when you do not use the /u flag.

Word Boundaries

Because .NET, Boost, and std::wregex don’t see any astral characters as word characters, word boundaries don’t work with astral characters either. \b matches before and after the ASCII A in A𝐀 where the second 𝐀 is the character U+1D400 which these applications see as the surrogate pair U+D835 U+DC00. Worse is that \B matches between the two surrogates as well as at the end of the string, because each surrogate is a non-word character.

Just as JavaScript only matches ASCII characters with \w, it only matches \b at a position that is either preceded by or followed by (but not both) an ASCII letter, digit, or underscore. So it matches \b matches before and after the ASCII A in A𝐀, again regardless of /u. But /u does affect the non-boundary \B. Without /u it can match between a surrogate pair, while with /u it can match only before the surrogate pair and after the surrogate pair if the pair is not preceded by or followed by an ASCII word character.

We haven’t talked about Java thus far because for the most part it handles astral characters just fine. If you compile your regex with Pattern.UNICODE_CHARACTER_CLASS then \w matches 𝐀 and \b matches at the start and end of A𝐀 but not between the two letters (the latter being an astral character). But, Java does have a bug that causes \B to be able to match in the middle of a surrogate pair, regardless of whether the astral character represented by the surrogate pair is a word character or not.

Unicode Properties

Boost and std::wregex don’t support any Unicode properties. But .NET does support Unicode categories and blocks.

Unicode categories can contain a mixture of BMP and astral characters. The 🦉 emoji U+1F989 is in the “other symbol” category. But .NET does not match it with \p{So}. Because the .NET regex engine operates on 16-bit code points, it instead sees two surrogates that are part of the “surrogate” category. Thus the negated property \P{So} or the property \p{Cs} match each of these two surrogates in .NET. You can use \p{Cs}+ to match a string of astral characters.

Unicode blocks never span across planes. So a block is either completely within the BMP or completely outside the BMP. The .NET regex flavor simply does not recognize the names of blocks that are outside the BMP. It treats \p{IsMusicalSymbols} as a syntax error even though that block existed in Unicode 4.0.1 (the version that .NET seems forever stuck at for Unicode blocks). Unicode does define two blocks for the surrogates, which .NET supports. \p{IsHighSurrogates}\p{IsLowSurrogates} matches a single astral character. (?:\p{IsHighSurrogates}\p{IsLowSurrogates}){3} matches 3 astral characters. (?:\P{Cs}|\p{IsHighSurrogates}\p{IsLowSurrogates}){3} matches 3 Unicode characters, correctly counting both BMP and astral characters.

Other regex flavors that support Unicode categories and blocks do treat \p{Cs} and \p{InHighSurrogates}\p{InLowSurrogates} as valid syntax. But they will not match surrogate pairs in UTF-16 files because those are interpreted as astral characters that are in their own categories and blocks. But these regexes could find matches in invalid Unicode files. You could use \p{Cs} to detect files with invalid surrogates. UTF-8 and UTF-32 files are not supposed to contain surrogates at all. UTF-16 files should only contain surrogates as pairs. \p{Cs} will match any surrogate in UTF-8 and UTF-32 files and will match surrogates that aren’t part of a surrogate pair in UTF-16 files.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |