| Unicode Regexes |
| Introduction |
| Astral Characters |
| Code Points and Graphemes |
| Unicode Categories |
| Unicode Scripts |
| Unicode Blocks |
| Unicode Binary Properties |
| Unicode Property Sets |
| Unicode Script Runs |
| Unicode Boundaries |
The Unicode standard divides the Unicode character map into different blocks. Each block consists of a multiple of 16 code points. Many blocks have unassigned code points that may or may not be assigned in future Unicode versions. The blocks themselves are stable as of Unicode 3.2.0. New blocks may be added to cover ranges that are not yet covered by any block. But existing blocks will never be changed.
Because of this stability rule and how Unicode has grown over the decades, the arrangement of the blocks has become quite haphazard. If you need to match certain kinds of characters you are usually much better of working with Unicode scripts, Unicode categories, or other Unicode properties.
For example, the Currency block does not include the dollar and yen symbols. Those are found in the Basic_Latin and Latin-1_Supplement blocks instead, even though both are currency symbols, and the yen symbol is not a Latin character. This is for historical reasons, because the ASCII standard includes the dollar sign, and the ISO-8859 standard includes the yen sign. The Unicode category \p{Sc} or \p{Currency_Symbol} would be a better choice than the Unicode block \p{Block=Currency_Symbols} when trying to find all currency symbols.
To further illustrate this, look at all the blocks intended for Latin characters. The Unicode block reference has the complete list of Unicode blocks along with the flavors that support them.
| Block Name | Age | Code Points |
|---|---|---|
| Basic_Latin | Unicode 1.1.0 | U+0000–U+007F |
| Latin-1_Supplement | Unicode 1.1.0 | U+0080–U+00FF |
| Latin_Extended-A | Unicode 1.1.0 | U+0100–U+017F |
| Latin_Extended-B | Unicode 1.1.0 | U+0180–U+024F |
| Latin_Extended_Additional | Unicode 1.1.0 | U+1E00–U+1EFF |
| Latin_Extended-C | Unicode 5.0.0 | U+2C60–U+2C7F |
| Latin_Extended-D | Unicode 5.0.0 | U+A720–U+A7FF |
| Latin_Extended-E | Unicode 7.0.0 | U+AB30–U+AB6F |
| Latin_Extended-F | Unicode 14.0.0 | U+10780–U+107BF |
| Latin_Extended-G | Unicode 14.0.0 | U+1DF00–U+1DFFF |
ICU, Perl, Java 7 and later match one code point in a Unicode block with \p{blk=BlockName} and \p{Block=BlockName}. \p{Block=Arrows} matches one code point between U+2190 and U+21FF.
These 3 flavors and also Ruby and the JGsoft flavor support \p{InBlockName}. \p{InArrows} matches one code point between U+2190 and U+21FF. The In prefix ensures the name is interpreted as a block name and not as a script name (see below).
XML Schema, XPath, and .NET match block names with the \p{IsBlockName}. \p{IsArrows} matches one code point between U+2190 and U+21FF. These 3 flavors do not support Unicode scripts at all and do not support any other syntax for blocks. So with these flavors there is no confusion between blocks and scripts using the \p{IsBlockName} syntax.
Some Unicode scripts have the exact same names as Unicode blocks. But they do not match the same characters. \p{Block=Georgian} matches any code point between U+10A0–U+10FF, including the unassigned code points such as U+10C8 and U+10CF. \p{Script=Georgian} does not match those unassigned code points and does not match U+10FB because that is a punctuation character that is in the Common script. But the script does match all the letters in the block Georgian_Supplement.
The JGsoft and Perl regex flavors support both \p{Block_Name} and \p{Script_Name}. They also support both \p{IsBlock_Name} and \p{IsScript_Name}. When a name could refer to either a block or a script they always interpret it as a script name.
No other flavors support \p{Block_Name} without any prefix. It is best not to use this syntax.
The canonical names for many Unicode blocks include underscores and/or hyphens. Java, .NET, and the XML flavors require you to specify block names with hyphens but without underscores. So you need to use \p{InLatinExtended-A} to match a code point in the block Latin_Extended-A. The other flavors that support Unicode blocks don’t care about underscores or hyphens.
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |
Page URL: https://www.regular-expressions.info/unicodeblock.html
Page last updated: 16 June 2025
Site last updated: 29 October 2025
Copyright © 2003-2025 Jan Goyvaerts. All rights reserved.