| Unicode Regexes |
| Introduction |
| Astral Characters |
| Code Points and Graphemes |
| Unicode Categories |
| Unicode Scripts |
| Unicode Blocks |
| Unicode Binary Properties |
| Unicode Property Sets |
| Unicode Script Runs |
| Unicode Boundaries |
The topic about Unicode scripts explains how Unicode assigns each character to a script and how we can match characters from a specific script with a regular expression. \p{Script=Greek}+ matches a run of Greek characters. This works fine if we only have one script to deal with. We could use \p{Script=Greek}+
But what if we want to match two words separated by whitespace and both words must be from the same script? Using the script property, we'd have to add a separate alternative to the regex for each script that we want to support. That’s unwieldy.
To make this possible, Perl 5.28 introduced a new regular expression concept called a script run. PCRE2 10.33 adopted this feature, which then brought it to PHP 7.4.0 and R 4.0.0.
A regular expression script run is a special group that essentially forces its contents to match characters that are all from the same Unicode script. (*sr:regex) and (*script_run:regex) are synonyms and regex represents any regular expression. (*script_run:\w+\s+\w+) matches two words separated by whitespace. The words and the whitespace must all be part of a single script according to the following rules.
When the regex engine encounters the script run, it first tries to match regex normally, without any regard to Unicode scripts. If the regex fails then the script run fails, just like a non-capturing group would. If the regex finds a zero-length match or if it matches a single character then the script run accepts that as a valid match, even if that single character is an unassigned or private use code point.
The script run applies its extra logic only when the regex matches two or more characters. If any one of those characters can be matched by \p{Script=Unknown} then the script run fails. This means a script run cannot contain unassigned code points, private use code points, or surrogates.
If the first rule is met then the script run checks the Script_Extensions property of each character. We’ll use the abbreviation scx here. Characters that can be matched by [\p{scx=Common}\p{scx=Inherited}] are ignored. This matches most whitespace and punctuation so these don’t influence script runs. If this matches all characters then the second rule is satisfied. All other characters need to share at least one value for Script_Extensions. (ЖЯ) satisfies this because the ASCII parentheses are matched by \p{scx=Common} and the letters are matched by \p{scx=Cyrillic}. Ж᠅ fails this rule because the first character has Cyrillic as its only scx value while the second has Mongolian and Phags_Pa as its two scx values. Remember that script runs are based on Script_Extensions. The fact that \p{Script=Common} matches ᠅ is irrelevant. The scx values override this. ᠠ᠅ is a valid script run. Both characters can be matched by \p{scx=Mongolian}. One shared value for scx is sufficient. The fact that \p{scx=Phags_Pa} matches only the second character is irrelevant.
Perl and PCRE2 take Unicode Technical Standard 39 section 5.1 into account when determining the set of scx values for each character. This mechanism adds 3 new values to the scx property for certain characters. These values do not otherwise exist in Unicode. Hanb (Han with Bopomofo) is added to characters that have Han and/or Bopomofo among their scx values. Jpan (Japanese) is added when scx contains Han, Hiragana, and/or Katakana. Kore (Korean) is added when scx contains Han and/or Hangul. ねガ is a valid script run. Though the first character only has scx=Hiragana and the second only has scx=Katakana, UTS 39 5.1 adds scx=Jpan to both characters. Thus they have Jpan in common and form a valid script run.
The final rule applies if any of the characters in the run are decimal digits. These characters are in the category \p{Nd}. In Unicode, all such characters are defined with 10 consecutive code points. The ASCII digits 0 to 9 occupy code points U+0030 to U+0039. The mathematical double-struck digits 𝟘 to 𝟡 occupy code points U+1D7D8 to U+1D7E1. All of these are matched by \p{scx=Common}. Yet 123𝟙𝟚𝟛 is not a valid script run because a script run can only contain digits from one contiguous range of 10 code points. Even ퟗퟘ is not valid. Though these are adjacent code points U+1D7D7 and U+1D7D8, they are not part of the same range of 10 digits. The mathematical bold digits 𝟎 to 𝟗 form a separate range of 10 digits from U+1D7CE to U+1D7D7.
If the regex engine determines that the regex inside (*script_run:regex) is not a valid regex then the group itself fails. The contents of the group are allowed to backtrack. When (*script_run:\d+) is applied to the string 123𝟙𝟚𝟛, \d+ initially matches the whole string. But this is not a valid script run. The group fails, but \d+ can backtrack. It gives up one character. 123𝟙𝟚 is still not a valid script run. Backtracking two more times, \d+ reduces its match to 123. This is a valid script run. The group matches and an overall regex match is found.
This may not be desirable. If you want to check whether a run of digits is from a single range of 10 digits then you may not want the regex to split a run of 6 digits into two runs of 3 digits. In this simple case we could solve this by making the quantifier possessive: (*script_run:\d++). But when the regex inside the group is more complex, we can make the script run itself enforce this by making it atomic: (*atomic_script_run:\d+) and (*asr:\d+). Now, when \d+ has matched all 6 characters, the atomic script run first throws away all backtracking positions that \d+ stored. Then, it checks whether the match is a valid script run. It is not, so the group fails without any backtracking.
The regex can still advance through the string, however. The whole process repeats starting at the second character in the string where 23𝟙𝟚 is found not to be a valid script run. Advancing one more character, 3𝟙𝟚 also fails the test. But then, advancing one more character, 𝟙𝟚 is matched successfully. If we want to check whether a string consists entirely of digits from the same range then we need to add anchors: ^(*script_run:\d+)$.
For the same reasons, \b(*atomic_script_run:\w+\s+\w+) is a better solution for finding two words from the same script delimited by whitespace. The word boundary at the start makes sure we don’t begin the match in the middle of the word. The atomic script run ensures the second \w+ can’t give up part of its match when the second word is of a mixed script. We could have used possessive quantifiers here too. \b(*script_run:\w++\s++\w++) produces the same result. So does \b(*atomic_script_run:\w++\s++\w++).
Detecting homographs in international domain names is a real-world example where script runs are an essential feature.
If you find the content on this website helpful they you may want a copy you can read offline or even print, or browse the site as often as you want without ads. You can purchase your own copy of the Regular-Expressions.info printable PDF download. As a bonus, you'll get a lifetime of advertisement-free access to this site!
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |
Page URL: https://www.regular-expressions.info/unicodescriptrun.html
Page last updated: 19 June 2025
Site last updated: 09 January 2026
Copyright © 2003-2026 Jan Goyvaerts. All rights reserved.