Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Unicode Regexes
Introduction
Astral Characters
Code Points and Graphemes
Unicode Categories
Unicode Scripts
Unicode Blocks
Unicode Binary Properties
Unicode Property Sets
Unicode Script Runs
Unicode Boundaries
Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Relative Backreferences
Branch Reset Groups
Free-Spacing & Comments
Unicode Characters & Properties
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Lookbehind Limitations
(Non-)Atomic Lookaround
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Infinite Recursion
Recursion & Quantifiers
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
Backtracking Control Verbs
Control Verb Arguments
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Unicode Categories

Each Unicode character belongs to a certain category. Unicode categories, or “general categories” as they’re called by the Unicode standard, are the most fundamental Unicode property. Every regex flavor that supports Unicode properties at all supports Unicode categories. That includes .NET, Java, ICU, JavaScript with /u, Ruby, JGsoft, Perl, PCRE, PCRE2. What is said below about PCRE and PCRE2 also applies to PHP, R, and Delphi.

All these flavors support the \p{Property} syntax with the property being a single letter or two letter representing the category. You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}. \p{Ll} matches a lowercase letter while \P{Ll} matches any character that is not a lowercase letter.

Again, “character” really means “Unicode code point”. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300 then it matches a without the accent. If the input is à encoded as U+00E0 then it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category “letter”, while U+0300 is in the category “mark”.

ICU, Perl, Ruby, JavaScript, and the JGsoft applications allow you to spell out the full category names, such as \p{Letter} or \p{Lowercase_Letter}.

PCRE and .NET are case sensitive for the category letters. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. The first letter needs to be uppercase and the second letter, if used, needs to be lowercase. PCRE2 was case sensitive originally, but because case insensitive with version 10.40. PHP became case insensitive with version 8.2.0 and R with version 4.2.2. All other regex engines described in this tutorial ignore the case of the category between the curly braces. \p{zs}, \p{ZS}, and \p{zS} all match a single space separator. But, it’s best to stick with the capitalization required by the case sensitive flavors. It is how the category letters are defined in Unicode. It will make your regular expressions work with all Unicode regex engines.

Java, ICU, Perl, and JavaScript allow you to full property set syntax for categories. \p{gc=Lu} matches an uppercase letter just like \p{Lu}. Except Java, these flavors also support the long form \p{General_Category=Uppercase_Letter}.

In addition to the standard notation with curly braces, \p{L}, Java, Perl, PCRE, PCRE2, and the JGsoft engine allow you to use the shorthand \pL without curly braces. The shorthand only works with single-letter Unicode categories. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.

These are all the general categories defined in Unicode. Every code point is part of exactly one two-letter category. Single-letter categories include all the characters of all two-letter categories that start with the same letter.

The categories themselves are the same in every version of Unicode. But code points can be moved between categories with each new version of Unicode. Code points for new characters are always moved from the Unassigned category to another category. But previously assigned code points can also be moved. The Georgian letters U+10C0–U+10FA, for example, were originally in the lowercase letter category. Unicode 3.0.0 moved them to the “other letter” category because they didn’t have uppercase equivalents. Unicode 11.0.0 added uppercase equivalents of these letters and moved the original letters back to the lowercase letter category.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| Introduction | Astral Characters | Code Points and Graphemes | Unicode Categories | Unicode Scripts | Unicode Blocks | Unicode Binary Properties | Unicode Property Sets | Unicode Script Runs | Unicode Boundaries |

| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode Characters & Properties | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Lookbehind Limitations | (Non-)Atomic Lookaround | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches | Backtracking Control Verbs | Control Verb Arguments |