Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Unicode Regexes
Introduction
Astral Characters
Code Points and Graphemes
Unicode Categories
Unicode Scripts
Unicode Blocks
Unicode Binary Properties
Unicode Property Sets
Unicode Script Runs
Unicode Boundaries
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Infinite Recursion
Recursion & Quantifiers
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Unicode Blocks

The Unicode standard divides the Unicode character map into different blocks. Each block consists of a multiple of 16 code points. Many blocks have unassigned code points that may or may not be assigned in future Unicode versions. The blocks themselves are stable as of Unicode 3.2.0. New blocks may be added to cover ranges that are not yet covered by any block. But existing blocks will never be changed.

Because of this stability rule and how Unicode has grown over the decades, the arrangement of the blocks has become quite haphazard. If you need to match certain kinds of characters you are usually much better of working with Unicode scripts, Unicode categories, or other Unicode properties.

For example, the Currency block does not include the dollar and yen symbols. Those are found in the Basic_Latin and Latin-1_Supplement blocks instead, even though both are currency symbols, and the yen symbol is not a Latin character. This is for historical reasons, because the ASCII standard includes the dollar sign, and the ISO-8859 standard includes the yen sign. The Unicode category \p{Sc} or \p{Currency_Symbol} would be a better choice than the Unicode block \p{Block=Currency_Symbols} when trying to find all currency symbols.

To further illustrate this, look at all the blocks intended for Latin characters. The Unicode block reference has the complete list of Unicode blocks along with the flavors that support them.

Block NameAgeCode Points
Basic_LatinUnicode 1.1.0U+0000–U+007F
Latin-1_SupplementUnicode 1.1.0U+0080–U+00FF
Latin_Extended-AUnicode 1.1.0U+0100–U+017F
Latin_Extended-BUnicode 1.1.0U+0180–U+024F
Latin_Extended_AdditionalUnicode 1.1.0U+1E00–U+1EFF
Latin_Extended-CUnicode 5.0.0U+2C60–U+2C7F
Latin_Extended-DUnicode 5.0.0U+A720–U+A7FF
Latin_Extended-EUnicode 7.0.0U+AB30–U+AB6F
Latin_Extended-FUnicode 14.0.0U+10780–U+107BF
Latin_Extended-GUnicode 14.0.0U+1DF00–U+1DFFF

Regex Syntax for Unicode Blocks

ICU, Perl, Java 7 and later match one code point in a Unicode block with \p{blk=BlockName} and \p{Block=BlockName}. \p{Block=Arrows} matches one code point between U+2190 and U+21FF.

These 3 flavors and also Ruby and the JGsoft flavor support \p{InBlockName}. \p{InArrows} matches one code point between U+2190 and U+21FF. The In prefix ensures the name is interpreted as a block name and not as a script name (see below).

XML Schema, XPath, and .NET match block names with the \p{IsBlockName}. \p{IsArrows} matches one code point between U+2190 and U+21FF. These 3 flavors do not support Unicode scripts at all and do not support any other syntax for blocks. So with these flavors there is no confusion between blocks and scripts using the \p{IsBlockName} syntax.

Identical Script and Block Names

Some Unicode scripts have the exact same names as Unicode blocks. But they do not match the same characters. \p{Block=Georgian} matches any code point between U+10A0–U+10FF, including the unassigned code points such as U+10C8 and U+10CF. \p{Script=Georgian} does not match those unassigned code points and does not match U+10FB because that is a punctuation character that is in the Common script. But the script does match all the letters in the block Georgian_Supplement.

The JGsoft and Perl regex flavors support both \p{Block_Name} and \p{Script_Name}. They also support both \p{IsBlock_Name} and \p{IsScript_Name}. When a name could refer to either a block or a script they always interpret it as a script name.

No other flavors support \p{Block_Name} without any prefix. It is best not to use this syntax.

Spaces, Hyphens, and Underscores

The canonical names for many Unicode blocks include underscores and/or hyphens. Java, .NET, and the XML flavors require you to specify block names with hyphens but without underscores. So you need to use \p{InLatinExtended-A} to match a code point in the block Latin_Extended-A. The other flavors that support Unicode blocks don’t care about underscores or hyphens.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |