Quick Start
Tutorial
Tools & Languages
Examples
Reference
Book Reviews
RegexBuddy Easily create and understand regular expressions today.
Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now

POSIX Bracket Expressions

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. So in POSIX, the regular expression [\d] matches a \ or a d. To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ]. Put together, []\d^-] matches ], \, d, ^ or -.

The main purpose of bracket expressions is that they adapt to the user's or application's locale. A locale is a collection of rules and settings that describe language and cultural conventions, like sort order, date format, etc. The POSIX standard defines these locales.

Generally, only POSIX-compliant regular expression engines have proper and full support for POSIX bracket expressions. Some non-POSIX regex engines support POSIX character classes, but usually don't support collating sequences and character equivalents. Regular expression engines that support Unicode use Unicode properties and scripts to provide functionality similar to POSIX bracket expressions. In Unicode regex engines, shorthand character classes like \w normally match all relevant Unicode characters, alleviating the need to use locales.

Character Classes

Don't confuse the POSIX term "character class" with what is normally called a regular expression character class. [x-z0-9] is an example of what this tutorial calls a "character class" and what POSIX calls a "bracket expression". [:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]]. The POSIX character class names must be written all lowercase.

When used on ASCII strings, these two regular expressions find exactly the same matches: a single character that is either x, y, z, or a digit. When used on strings with non-ASCII characters, the [:digit:] class may include digits in other scripts, depending on the locale.

The POSIX standard defines 12 character classes. The table below lists all 12, plus the [:ascii:] and [:word:] classes that some regex flavors support. The table also shows equivalent character classes that you can use in ASCII and Unicode regular expressions if the POSIX classes are unavailable. The ASCII equivalents correspond exactly what is defined in the POSIX standard. The Unicode equivalents correspond to what most Unicode regex engines match. The POSIX standard does not define a Unicode locale. Some classes also have Perl-style shorthand equivalents.

Java does not support POSIX bracket expressions, but does support POSIX character classes using the \p operator. Though the \p syntax is borrowed from the syntax for Unicode properties, the POSIX classes in Java only match ASCII characters as indicated below. The class names are case sensitive. Unlike the POSIX syntax which can only be used inside a bracket expression, Java's \p can be used inside and outside bracket expressions.

POSIXDescriptionASCIIUnicodeShorthandJava
[:alnum:] Alphanumeric characters [a-zA-Z0-9] [\p{L&}\p{Nd}] \p{Alnum}
[:alpha:] Alphabetic characters [a-zA-Z] \p{L&} \p{Alpha}
[:ascii:] ASCII characters [\x00-\x7F] \p{InBasicLatin} \p{ASCII}
[:blank:] Space and tab [ \t] [\p{Zs}\t] \h \p{Blank}
[:cntrl:] Control characters [\x00-\x1F\x7F] \p{Cc} \p{Cntrl}
[:digit:] Digits [0-9] \p{Nd} \d \p{Digit}
[:graph:] Visible characters (i.e. anything except spaces, control characters, etc.) [\x21-\x7E] [^\p{Z}\p{C}] \p{Graph}
[:lower:] Lowercase letters [a-z] \p{Ll} \p{Lower}
[:print:] Visible characters and spaces (i.e. anything except control characters, etc.) [\x20-\x7E] \P{C} \p{Print}
[:punct:] Punctuation and symbols. [!"#$%&'()*+,
 \-./:;<=>?@
 [\\\]^_`{|}~]
[\p{P}\p{S}] \p{Punct}
[:space:] All whitespace characters, including line breaks [ \t\r\n\v\f] [\p{Z}\t\r\n\v\f] \s \p{Space}
[:upper:] Uppercase letters [A-Z] \p{Lu} \p{Upper}
[:word:] Word characters (letters, numbers and underscores) [A-Za-z0-9_] [\p{L}\p{N}\p{Pc}] \w
[:xdigit:] Hexadecimal digits [A-Fa-f0-9] [A-Fa-f0-9] \p{XDigit}

Collating Sequences

A POSIX locale can have collating sequences to describe how certain characters or groups of characters should be ordered. In Spanish, for example, ll as in tortilla is treated as one character, and is ordered between l and m in the alphabet. You can use the collating sequence element [.span-ll.] inside a bracket expression to match ll. The regex torti[[.span-ll.]]a matches tortilla. Notice the double square brackets. One pair for the bracket expression, and one pair for the collating sequence.

Other than POSIX-compliant engines part of a POSIX-compliant system, none of the regex flavors discussed in this tutorial support collating sequences.

Note that a fully POSIX-compliant regex engine treats ll as a single character when the locale is set to Spanish. This means that torti[^x]a also matches tortilla. [^x] matches a single character that is not an x, which includes ll in the Spanish POSIX locale.

In any other regular expression engine, or in a POSIX engine not using the Spanish locale, torti[^x]a matches the misspelled word tortila but not tortilla, as [^x] cannot match the two characters ll.

Finally, note that not all regex engines claiming to implement POSIX regular expressions actually have full support for collating sequences. Sometimes, these engines use the regular expression syntax defined by POSIX, but don't have full locale support. You may want to try the above matches to see if the engine you're using does. Tcl's regexp command, for example, supports the syntax for collating sequences. But Tcl only supports the Unicode locale, which does not define any collating sequences. The result is that in Tcl, a collating sequence specifying a single character matches just that character. All other collating sequences result in an error.

Character Equivalents

A POSIX locale can define character equivalents that indicate that certain characters should be considered as identical for sorting. In French, for example, accents are ignored when ordering words. élève comes before être which comes before événement. é and ê are all the same as e, but l comes before t which comes before v. With the locale set to French, a POSIX-compliant regular expression engine matches e, é, è and ê when you use the collating sequence [=e=] in the bracket expression [[=e=]].

If a character does not have any equivalents, the character equivalence token simply reverts to the character itself. [[=x=][=z=]], for example, is the same as [xz] in the French locale.

Like collating sequences, POSIX character equivalents are not available in any regex engine discussed in this tutorial, other than those following the POSIX standard. And those that do may not have the necessary POSIX locale support. Here too Tcl's regexp command supports the syntax for character equivalents. But the Unicode locale, the only one Tcl supports, does not define any character equivalents. This effectively means that [[=e=]] and [e] are exactly the same in Tcl, and only match e, for any character you may try instead of "e".

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Regex Tutorial
Introduction
Table of Contents
Special Characters
Non-Printable Characters
Regex Engine Internals
Character Classes
Character Class Subtraction
Character Class Intersection
Shorthand Character Classes
Dot
Anchors
Word Boundaries
Alternation
Optional Items
Repetition
Grouping & Capturing
Backreferences
Backreferences, part 2
Named Groups
Branch Reset Groups
Free-Spacing & Comments
Unicode
Mode Modifiers
Atomic Grouping
Possessive Quantifiers
Lookahead & Lookbehind
Lookaround, part 2
Keep Text out of The Match
Conditionals
Balancing Groups
Recursion
Subroutines
Recursion & Capturing
Recursion & Backreferences
Recursion & Backtracking
POSIX Bracket Expressions
Zero-Length Matches
Continuing Matches
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP 4
PowerGREP PowerGREP is probably the most powerful regex-based text processing tool available today. A knowledge worker's Swiss army knife for searching through, extracting information from, and updating piles of files.
Use regular expressions to search through large numbers of text and binary files. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files or folders, or scan entire drives and network shares.
Search and replace using text, binary data or one or more regular expressions to automate repetitive editing tasks. Preview replacements before modifying files, and stay safe with flexible backup and undo options.
Use regular expressions to rename files, copy files, or merge and split the contents of files. Work with plain text files, Unicode files, binary files, compressed files, and files in proprietary formats such as MS Office, OpenOffice, and PDF. Runs on Windows 2000, XP, Vista, 7, 8, and 8.1.
More information
Download PowerGREP now