Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Unicode Regexes
Introduction
Astral Characters
Code Points and Graphemes
Unicode Categories
Unicode Scripts
Unicode Blocks
Unicode Binary Properties
Unicode Property Sets
Unicode Script Runs
Unicode Boundaries
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Infinite Recursion
Recursion & Quantifiers
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Unicode Scripts and Script Extensions

The Unicode standard places each assigned code point (character) into one script. A script is a group of code points used by a particular human writing system. Some scripts like Thai correspond with a single human language. Other scripts like Latin span multiple languages.

Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.

ICU, Perl, Java 7 and later, and JavaScript with /u match one code point in a Unicode script with \p{sc=Script_Name} and \p{Script=Script_Name}. \p{Script=Canadian_Aboriginal} matches (U+166F), for example. PCRE2 supports this notation starting with version 10.40. Everything said below about PCRE2 10.40 also applies to PHP does as of version 8.2.0 and R as of 4.2.2.

Two special scripts are the Common and Inherited scripts. The Common script contains all sorts of characters that are common to a wide range of scripts. It includes all sorts of punctuation, whitespace, and miscellaneous symbols. The Inherited script contains mostly combining characters that should take on the script of the base character they’re combined with during script analysis. But regex engines don’t do that. \p{Script=Latin} matches only the a in à encoded as U+0061 U+0300. To match both characters you’d need to use \p{Script=Latin}\p{Script=Inherited}. To match a string of Latin characters with combining diacritics you could use [\p{Script=Latin}\p{Script=Inherited}]+.

All assigned Unicode code points (those matched by \P{Cn}) are part of exactly one Unicode script. All unassigned Unicode code points (those matched by \p{Cn}) are either not part of any Unicode script at all or are part of the Unknown script, depending on the implementation. All aforementioned flavors support \p{Script=Unknown}. Ruby 1.9 and PCRE 10.33 support \p{Unknown}, a syntax we’ll discuss below.

Forcing each code point to be part of exactly one script does not work so well, particularly with tools like regex engines that don’t do script analysis. To alleviate this, Unicode 6.0.0 introduced the Script_Extensions property. This property uses the exact same script names as the Script property. All flavors that support \p{Script=Script_Name}, except Java, also support \p{scx=Script_Name} and \p{Script_Extensions=Script_Name}.

Every code point that has a Script property value other than Common or Inherited also has the same Script_Extensions property. So \p{Script_Extensions=Canadian_Aboriginal} also matches (U+166F). A code point can have additional values for Script_Extensions. The Devanagari digit 9 (U+096F) is matched by both \p{Script=Devanagari} and \p{Script_Extensions=Devanagari}. It is also matched by \p{Script_Extensions=Kaithi} (since Unicode 6.3.0), \p{Script_Extensions=Mahajani} (since Unicode 7.0.0), and \p{Script_Extensions=Dogra} (since Unicode 11.0.0). But it is never matched by \p{Script=Kaithi}, \p{Script=Mahajani}, or \p{Script=Dogra}.

Code points that Script=Common or Script=Inherited either have the same value, and only that value, for Script_Extensions, or they have multiple values for Script_Extensions that do not include Common or Inherited. For example, the ASCII digit 9 is matched by both \p{Script=Common} and \p{Script_Extensions=Common}. It does not have any other values for Script_Extensions. The ditto mark (U+3003) is matched by \p{Script=Common} but not by \p{Script_Extensions=Common}. Instead, it is matched by \p{Script_Extensions=Bopomofo}, \p{Script_Extensions=Han}, \p{Script_Extensions=Hangul}, \p{Script_Extensions=Hiragana}, and \p{Script_Extensions=Katakana}.

PCRE2, unfortunately, does not correctly implement \p{Script_Extensions=Common} and \p{Script_Extensions=Inherited}. It treats them as equivalent to \p{Script=Common} and \p{Script=Inherited}. Thus in PCRE2 10.40 and later, \p{Script_Extensions=Common}, does match (U+3003).

Unicode Script Value Only

The Unicode standard suggests that regular expression flavors should support \p{Script_Name}. But then the question becomes whether this should be based on the Script property or on the Script_Extensions property. Traditionally, it has been based on the Script property. This is the case in ICU, RE2, Ruby 1.9, PCRE 6.5, Delphi, and the JGsoft flavor. With these flavors, \p{Han} matches 䀀 (U+4000) but not (U+3003).

Other flavors changed their mind. Perl 5.26 and PCRE2 10.40 changed \p{Script_Name} to be based on the Script_Extensions property, while it was based on Script in older versions. So in Perl 5.26 and PCRE2 10.40, \p{Han} matches both 䀀 (U+4000) and (U+3003).

Java supports \p{IsScript_Name} with an extra Is prefix and implements it using the Script property. In Java, \p{IsHan} matches 䀀 (U+4000) but not (U+3003). Perl and the JGsoft applications support this syntax too. Perl 5.26 changes the implementation from Script to Script_Extensions. ICU 63 and later support this syntax based on Script_Extensions. ICU 62 and prior did not allow the Is prefix.

Short Script Names

Every Unicode script has both a full name and a 4-letter code. Java, ICU, Perl, Ruby, and PCRE 10.40 let you use the 4-letter codes in addition to the full names. So \p{sc=Cans} or \p{Cans} is short for \p{sc=Canadian_Aboriginal} or \p{Canadian_Aboriginal}. For a few scripts the 4-letter code is actually longer than the full name. \p{sc=Han} uses the script name Han while \p{sc=Hani} uses the equivalent 4-letter code Hani.

Identical Script and Block Names

Some Unicode scripts have the exact same names as Unicode blocks. But they do not match the same characters. \p{Block=Georgian} matches any code point between U+10A0–U+10FF, including the unassigned code points such as U+10C8 and U+10CF. \p{Script=Georgian} does not match those unassigned code points and does not match U+10FB because that is a punctuation character that is in the Common script. But the script does match all the letters in the block Georgian_Supplement.

The JGsoft and Perl regex flavors support both \p{Block_Name} and \p{Script_Name}. They also support both \p{IsBlock_Name} and \p{IsScript_Name}. When a name could refer to either a block or a script they always interpret it as a script name.

Full List of Script Names

Check the Unicode script reference for a complete list of all Unicode script names

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |