| Unicode Regexes |
| Introduction |
| Astral Characters |
| Code Points and Graphemes |
| Unicode Categories |
| Unicode Scripts |
| Unicode Blocks |
| Unicode Binary Properties |
| Unicode Property Sets |
| Unicode Script Runs |
| Unicode Boundaries |
Unicode was originally designed as a 16-bit character set. But that turned out to be insufficient if we want Unicode to support every character from every script from all of history and the future. So Unicode was extended to allow code points up to U+10FFFF. Unicode 3.1.0 assigned the first code points beyond U+FFFF way back in 2001. This included the 𝔪𝔞𝔱𝔥 𝓼𝔂𝓶𝓫𝓸𝓵𝓼 that are often used for more fanciful purposes and over 40,000 ideographs that may be important to your users in the Far East. Later versions of Unicode added many more characters, including everybody’s favorite emoji 😺🐶 in Unicode 6.0.0.
Unicode code points are organized into 17 planes numbered from 0 to 16. Only 7 of them are defined at the moment. It doesn’t really matter which plane a character is in beyond whether it is in plane 0 or in a higher plane. Characters in higher planes are collectively referred to as “astral” characters.
| Plane | Name | Number | Code points |
|---|---|---|---|
| BMP | Basic Multilingual Plane | 0 | U+0000–U+FFFF |
| SMP | Supplementary Multilingual Plane | 1 | U+10000–U+1FFFF |
| SIP | Supplementary Ideographic Plane | 2 | U+20000–U+2FFFF |
| TIP | Tertiary Ideographic Plane | 3 | U+30000–U+3FFFF |
| SSP | Supplementary Special-purpose Plane | 14 | U+E0000–U+EFFFF |
| SPUA-A | Supplementary Private Use Area-A | 15 | U+F0000–U+FFFFF |
| SPUA-B | Supplementary Private Use Area-B | 16 | U+100000–U+10FFFF |
But a lot of software, including Windows itself, is still designed around 16-bit characters. The UTF-16 encoding is designed to enable such software to handle astral characters. Code points U+D800–DBFF are reserved as “high surrogates” and code points U+DC00–DFFF are reserved as “low surrogates”. These code points should never appear as characters in Unicode files. UTF-16 uses them to encode astral characters as surrogate pairs consisting of one high surrogate followed by one low surrogate. Code point U+1F989 which represents the owl emoji 🦉, for example, is encoded as 0xD83E 0xDD89.
With most regex engines, you don’t need to worry about this. They only every see astral characters as individual characters, regardless of whether they are encoded as UTF-16 using surrogate pairs or as UTF-8 or UTF-32 which don’t use surrogates.
But some regex engines operate entirely on 16-bit code points. They see 🦉 as two separate 16-bit code points U+D83E U+DD89. This has significant implications which are explained in this topic. All other topics in this tutorial assume that your regex engine either sees 🦉 as a single code point U+1F989 or that the issue is irrelevant because your subject string does not contain any astral characters at all. The reality is that matching astral characters with a regex engine based on 16-bit characters is problematic.
.NET, std::wregex, and boost::wregex always operate on 16-bit code points. JavaScript operates on 16-bit code points without the /u flag but handles astral characters with the /u flag. The re module in Python 3.3 and later supports astral characters, while in Python 3.2 and prior it operated on 16-bit code points. All other flavors discussed in this tutorial either handle astral characters as such, or are 8-bit engines that can never encounter astral characters.
If the regex engine operates on 16-bit code points then the dot matches high and low surrogates separately. Which such an engine you need two dots .. or a repeated dot .+ to properly match the owl emoji. With an engine that treats code points beyond U+FFFF as single characters (regardless of how it is encoded internally) then . would match the owl emoji and .. would fail to match 🦉 because the string has only one character.
A quick way to test the application you’re using is to try the regex ^.. on the string 🦉🦉. If it matches both owls then the application’s regex engine supports astral characters, with each dot matching one of them. If it matches only the first owl then it operates on 16-bit code points, with the two dots matching the surrogate pair representing the first owl.
This issue is important if you’re trying to use the dot to match a single character or a specific number of characters. You can use .*? to skip over any number of characters regardless of your regex engine. You can use [^\uD800-\uDFFF]|
Boost and .NET match Unicode BMP characters with shorthand character classes. But they cannot match astral characters with shorthands because they see those as surrogate pairs, and the shorthands don’t match surrogates. So while U+1D400 represents a letter 𝐀 it is not matched by \w in .NET or Boost. The negated shorthand \W, however, does match each of the surrogates representing 𝐀. \W\W matches the full surrogate pair U+D835 U+DC00. All this also applies to std::wregex which only matches ASCII characters with shorthands.
JavaScript only matches ASCII characters with \d and \w, but Unicode characters with \s, regardless of whether you add the /u flag to the regexp. But \s never includes any astral characters in any regex flavor because Unicode does not categorize any astral characters as whitespace. So none of these shorthands match astral characters in JavaScript. Their negated counterparts \D and \W and \S match astral characters as a whole when you use the /u flag. They match individual surrogates when you do not use the /u flag.
Because .NET, Boost, and std::wregex don’t see any astral characters as word characters, word boundaries don’t work with astral characters either. \b matches before and after the ASCII A in A𝐀 where the second 𝐀 is the character U+1D400 which these applications see as the surrogate pair U+D835 U+DC00. Worse is that \B matches between the two surrogates as well as at the end of the string, because each surrogate is a non-word character.
Just as JavaScript only matches ASCII characters with \w, it only matches \b at a position that is either preceded by or followed by (but not both) an ASCII letter, digit, or underscore. So it matches \b matches before and after the ASCII A in A𝐀, again regardless of /u. But /u does affect the non-boundary \B. Without /u it can match between a surrogate pair, while with /u it can match only before the surrogate pair and after the surrogate pair if the pair is not preceded by or followed by an ASCII word character.
We haven’t talked about Java thus far because for the most part it handles astral characters just fine. If you compile your regex with Pattern.UNICODE_CHARACTER_CLASS then \w matches 𝐀 and \b matches at the start and end of A𝐀 but not between the two letters (the latter being an astral character). But, Java does have a bug that causes \B to be able to match in the middle of a surrogate pair, regardless of whether the astral character represented by the surrogate pair is a word character or not.
Boost and std::wregex don’t support any Unicode properties. But .NET does support Unicode categories and blocks.
Unicode categories can contain a mixture of BMP and astral characters. The 🦉 emoji U+1F989 is in the “other symbol” category. But .NET does not match it with \p{So}. Because the .NET regex engine operates on 16-bit code points, it instead sees two surrogates that are part of the “surrogate” category. Thus the negated property \P{So} or the property \p{Cs} match each of these two surrogates in .NET. You can use \p{Cs}+ to match a string of astral characters.
Unicode blocks never span across planes. So a block is either completely within the BMP or completely outside the BMP. The .NET regex flavor simply does not recognize the names of blocks that are outside the BMP. It treats \p{IsMusicalSymbols} as a syntax error even though that block existed in Unicode 4.0.1 (the version that .NET seems forever stuck at for Unicode blocks). Unicode does define two blocks for the surrogates, which .NET supports. \p{IsHighSurrogates}
Other regex flavors that support Unicode categories and blocks do treat \p{Cs} and \p{InHighSurrogates}
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |
Page URL: https://www.regular-expressions.info/unicodeastral.html
Page last updated: 16 June 2025
Site last updated: 29 October 2025
Copyright © 2003-2025 Jan Goyvaerts. All rights reserved.