|Easily create and understand regular expressions today. |
Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. Very useful if you do not know whether the document you are searching through is written in American or British English.
A character class matches only a single character. gr[ae]y will not match graay, graey or any such thing. The order of the characters inside a character class does not matter. The results are identical.
You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.
Find a word, even if it is misspelled, such as sep[ae]r[ae]te or li[cs]en[cs]e.
Find an identifier in a programming language with [A-Za-z_][A-Za-z_0-9]*.
Find a C-style hexadecimal number with 0[xX][A-Fa-f0-9]+.
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters.
It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match the q and the space after the q in Iraq is a country. Indeed: the space will be part of the overall match, because it is the "character that is not a u" that is matched by the negated character class in the above regexp. If you want the regex to match the q, and only the q, in both strings, you need to use negative lookahead: q(?!u). But we will get to that later.
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x. The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. I recommend the latter method, since it improves readability. To include a caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.
You can use all non-printable characters in character classes just like you can use them outside of character classes. E.g. [$\u20AC] matches a dollar or euro sign, assuming your regex flavor supports Unicode.
POSIX regular expressions treat the backslash as a literal character inside character classes. This means you can't use backslashes to escape the closing bracket (]), the caret (^) and the hyphen (-). To use these characters, position them as explained above in this section. This also means that special tokens like shorthands are not available in POSIX regular expressions. See the tutorial topic on POSIX bracket expressions for more information.
Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9].
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.
Shorthand character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character followed by a digit. [\s\d] matches a single character that is either whitespace or a digit. When applied to 1 + 2 = 3, the former regex will match 2 (space two), while the latter matches 1 (one). [\da-fA-F] matches a hexadecimal digit, and is equivalent to [0-9a-fA-F].
The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s].
Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter will match any character that is not a digit or whitespace. So it will match x, but not 8. The former, however, will match any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] will match any character, digit, whitespace or otherwise.
If you repeat a character class by using the ?, * or + operators, you will repeat the entire character class, and not just the character that it matched. The regex [0-9]+ can match 837 as well as 222.
If you want to repeat the matched character, rather than the class, you will need to use backreferences. ([0-9])\1+ will match 222 but not 837. When applied to the string 833337, it will match 3333 in the middle of this string. If you do not want that, you need to use lookahead and lookbehind.
But I digress. I did not yet explain how character classes work inside the regex engine. Let us take a look at that first.
As I already said: the order of the characters inside a character class does not matter. gr[ae]y will match grey in Is his hair grey or gray?, because that is the leftmost match. We already saw how the engine applies a regex consisting only of literal characters. Below, I will explain how it applies a regex that has more than one permutation. That is: gr[ae]y can match both gray and grey.
Nothing noteworthy happens for the first twelve characters in the string. The engine will fail to match g at every step, and continue with the next character in the string. When the engine arrives at the 13th character, g is matched. The engine will then try to match the remainder of the regex with the text. The next token in the regex is the literal r, which matches the next character in the text. So the third token, [ae] is attempted at the next character in the text (e). The character class gives the engine two options: match a or match e. It will first attempt to match a, and fail.
But because we are using a regex-directed engine, it must continue trying to match all the other permutations of the regex pattern before deciding that the regex cannot be matched with the text starting at character 13. So it will continue with the other option, and find that e matches e. The last regex token is y, which can be matched with the following character as well. The engine has found a complete match with the text starting at character 13. It will return grey as the match result, and look no further. Again, the leftmost match was returned, even though we put the a first in the character class, and gray could have been matched in the string. But the engine simply did not get that far, because another equally valid match was found to the left of it.
Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!
Page URL: http://www.Regular-Expressions.info/charclass.html
Page last updated: 15 August 2012
Site last updated: 18 April 2013
Copyright © 2003-2013 Jan Goyvaerts. All rights reserved.
|Table of Contents|
|Regex Engine Internals|
|Grouping & Backreferences|
|Lookahead & Lookbehind|
|Lookaround, part 2|
|XML Character Classes|
|POSIX Bracket Expressions|
|Tools and Languages|
|About This Site|
|RSS Feed & Blog|
|PowerGREP is probably the most powerful regex-based text processing tool available today. A knowledge worker's Swiss army knife for searching through, extracting information from, and updating piles of files.|
|Use regular expressions to search through large numbers of text and binary files. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files or folders, or scan entire drives and network shares.|
|Search and replace using text, binary data or one or more regular expressions to automate repetitive editing tasks. Preview replacements before modifying files, and stay safe with flexible backup and undo options.|
|Use regular expressions to rename files, copy files, or merge and split the contents of files. Work with plain text files, Unicode files, binary files, compressed files, and files in proprietary formats such as MS Office, OpenOffice, and PDF. Runs on Windows 2000, XP, Vista, 7, and 8.|
|Download PowerGREP now|