| Unicode Regexes |
| Introduction |
| Astral Characters |
| Code Points and Graphemes |
| Unicode Categories |
| Unicode Scripts |
| Unicode Blocks |
| Unicode Binary Properties |
| Unicode Property Sets |
| Unicode Script Runs |
| Unicode Boundaries |
The Unicode standard places each assigned code point (character) into one script. A script is a group of code points used by a particular human writing system. Some scripts like Thai correspond with a single human language. Other scripts like Latin span multiple languages.
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.
ICU, Perl, Java 7 and later, and JavaScript with /u match one code point in a Unicode script with \p{sc=Script_Name} and \p{Script=Script_Name}. \p{Script=Canadian_Aboriginal} matches ᙯ (U+166F), for example. PCRE2 supports this notation starting with version 10.40. Everything said below about PCRE2 10.40 also applies to PHP does as of version 8.2.0 and R as of 4.2.2.
Two special scripts are the Common and Inherited scripts. The Common script contains all sorts of characters that are common to a wide range of scripts. It includes all sorts of punctuation, whitespace, and miscellaneous symbols. The Inherited script contains mostly combining characters that should take on the script of the base character they’re combined with during script analysis. But regex engines don’t do that. \p{Script=Latin} matches only the a in à encoded as U+0061 U+0300. To match both characters you’d need to use \p{Script=Latin}\p{Script=Inherited}. To match a string of Latin characters with combining diacritics you could use [\p{Script=Latin}
All assigned Unicode code points (those matched by \P{Cn}) are part of exactly one Unicode script. All unassigned Unicode code points (those matched by \p{Cn}) are either not part of any Unicode script at all or are part of the Unknown script, depending on the implementation. All aforementioned flavors support \p{Script=Unknown}. Ruby 1.9 and PCRE 10.33 support \p{Unknown}, a syntax we’ll discuss below.
Forcing each code point to be part of exactly one script does not work so well, particularly with tools like regex engines that don’t do script analysis. To alleviate this, Unicode 6.0.0 introduced the Script_Extensions property. This property uses the exact same script names as the Script property. All flavors that support \p{Script=Script_Name}, except Java, also support \p{scx=Script_Name} and \p{Script_Extensions=Script_Name}.
Every code point that has a Script property value other than Common or Inherited also has the same Script_Extensions property. So \p{Script_Extensions=Canadian_Aboriginal} also matches ᙯ (U+166F). A code point can have additional values for Script_Extensions. The Devanagari digit 9 ९ (U+096F) is matched by both \p{Script=Devanagari} and \p{Script_Extensions=Devanagari}. It is also matched by \p{Script_Extensions=Kaithi} (since Unicode 6.3.0), \p{Script_Extensions=Mahajani} (since Unicode 7.0.0), and \p{Script_Extensions=Dogra} (since Unicode 11.0.0). But it is never matched by \p{Script=Kaithi}, \p{Script=Mahajani}, or \p{Script=Dogra}.
Code points that Script=Common or Script=Inherited either have the same value, and only that value, for Script_Extensions, or they have multiple values for Script_Extensions that do not include Common or Inherited. For example, the ASCII digit 9 is matched by both \p{Script=Common} and \p{Script_Extensions=Common}. It does not have any other values for Script_Extensions. The ditto mark 〃 (U+3003) is matched by \p{Script=Common} but not by \p{Script_Extensions=Common}. Instead, it is matched by \p{Script_Extensions=Bopomofo}, \p{Script_Extensions=Han}, \p{Script_Extensions=Hangul}, \p{Script_Extensions=Hiragana}, and \p{Script_Extensions=Katakana}.
PCRE2, unfortunately, does not correctly implement \p{Script_Extensions=Common} and \p{Script_Extensions=Inherited}. It treats them as equivalent to \p{Script=Common} and \p{Script=Inherited}. Thus in PCRE2 10.40 and later, \p{Script_Extensions=Common}, does match 〃 (U+3003).
The Unicode standard suggests that regular expression flavors should support \p{Script_Name}. But then the question becomes whether this should be based on the Script property or on the Script_Extensions property. Traditionally, it has been based on the Script property. This is the case in ICU, RE2, Ruby 1.9, PCRE 6.5, Delphi, and the JGsoft flavor. With these flavors, \p{Han} matches 䀀 (U+4000) but not 〃 (U+3003).
Other flavors changed their mind. Perl 5.26 and PCRE2 10.40 changed \p{Script_Name} to be based on the Script_Extensions property, while it was based on Script in older versions. So in Perl 5.26 and PCRE2 10.40, \p{Han} matches both 䀀 (U+4000) and 〃 (U+3003).
Java supports \p{IsScript_Name} with an extra Is prefix and implements it using the Script property. In Java, \p{IsHan} matches 䀀 (U+4000) but not 〃 (U+3003). Perl and the JGsoft applications support this syntax too. Perl 5.26 changes the implementation from Script to Script_Extensions. ICU 63 and later support this syntax based on Script_Extensions. ICU 62 and prior did not allow the Is prefix.
Every Unicode script has both a full name and a 4-letter code. Java, ICU, Perl, Ruby, and PCRE 10.40 let you use the 4-letter codes in addition to the full names. So \p{sc=Cans} or \p{Cans} is short for \p{sc=Canadian_Aboriginal} or \p{Canadian_Aboriginal}. For a few scripts the 4-letter code is actually longer than the full name. \p{sc=Han} uses the script name Han while \p{sc=Hani} uses the equivalent 4-letter code Hani.
Some Unicode scripts have the exact same names as Unicode blocks. But they do not match the same characters. \p{Block=Georgian} matches any code point between U+10A0–U+10FF, including the unassigned code points such as U+10C8 and U+10CF. \p{Script=Georgian} does not match those unassigned code points and does not match U+10FB because that is a punctuation character that is in the Common script. But the script does match all the letters in the block Georgian_Supplement.
The JGsoft and Perl regex flavors support both \p{Block_Name} and \p{Script_Name}. They also support both \p{IsBlock_Name} and \p{IsScript_Name}. When a name could refer to either a block or a script they always interpret it as a script name.
Check the Unicode script reference for a complete list of all Unicode script names
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |
Page URL: https://www.regular-expressions.info/unicodescript.html
Page last updated: 19 June 2025
Site last updated: 29 October 2025
Copyright © 2003-2025 Jan Goyvaerts. All rights reserved.