|Easily create and understand regular expressions today. |
Compose and analyze regex patterns with RegexBuddy's easy-to-grasp regex blocks and intuitive regex tree, instead of or in combination with the traditional regex syntax. Developed by the author of this website, RegexBuddy makes learning and using regular expressions easier than ever. Get your own copy of RegexBuddy now
I already explained how you can use character classes to match a single character out of several possible characters. Alternation is similar. You can use alternation to match a single regular expression out of several possible regular expressions.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping. If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either "cat" or "dog", and then another word boundary. If we had omitted the round brackets, the regex engine would have searched for "a word boundary followed by cat", or, "dog" followed by a word boundary.
I already explained that the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters. Suppose you want to use a regex to match a list of function names in a programming language: Get, GetValue, Set or SetValue. The obvious solution is Get|GetValue|Set|SetValue. Let's see how this works out when the string is SetValue.
The regex engine starts at the first token in the regex, G, and at the first character in the string, S. The match fails. However, the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G in the regex. The match fails again. The next token is the first S in the regex. The match succeeds, and the engine continues with the next character in the string, as well as the next token in the regex. The next token in the regex is the e after the S that just successfully matched. e matches e. The next token, t matches t.
At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no other tokens in the regex outside the alternation, so the entire regex has successfully matched Set in SetValue.
Contrary to what we intended, the regex did not match the entire string. There are several solutions. One option is to take into account that the regex engine is eager, and change the order of the options. If we use GetValue|Get|SetValue|Set, SetValue will be attempted before Set, and the engine will match the entire string. We could also combine the four options into two and use the question mark to make part of them optional: Get(Value)?|Set(Value)?. Because the question mark is greedy, SetValue will be attempted before Set.
The best option is probably to express the fact that we only want to match complete words. We do not want to match Set or SetValue if the string is SetValueFunction. So the solution is \b(Get|GetValue|Set|SetValue)\b or \b(Get(Value)?|Set(Value)?)\b. Since all options have the same end, we can optimize this further to \b(Get|Set)(Value)?\b .
All regex flavors discussed on this website work this way, except one: the POSIX standard mandates that the longest match be returned, regardless if the regex engine is implemented using an NFA or DFA algorithm.
Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!
Page URL: http://www.Regular-Expressions.info/alternation.html
Page last updated: 28 November 2010
Site last updated: 18 April 2013
Copyright © 2003-2013 Jan Goyvaerts. All rights reserved.
|Table of Contents|
|Regex Engine Internals|
|Grouping & Backreferences|
|Lookahead & Lookbehind|
|Lookaround, part 2|
|XML Character Classes|
|POSIX Bracket Expressions|
|Tools and Languages|
|About This Site|
|RSS Feed & Blog|