Regex Tools

General Applications

Languages & Libraries

Boost

Delphi

Databases

wxWidgets Supports Three Regular Expression Flavors

wxWidgets uses the exact same regular expression engine that was developed by Henry Spencer for Tcl 8.2. This means that wxWidgets supports the same three regular expressions flavors: Tcl Advanced Regular Expressions, POSIX Extended Regular Expressions and POSIX Basic Regular Expressions. Unlike in Tcl, EREs rather than the far more powerful AREs are the default. The wxRegEx::Replace() method uses the same syntax for the replacement text as Tcl’s regsub command.

The wxRegEx Class

To use the wxWidgets regex engine, you need to instantiate the wxRegEx class. The class has two constructors. wxRegEx() creates an empty regex object. Before you can use the object, you have to call wxRegEx::Compile(). wxRegEx::IsValid will return false until you do.

wxRegEx(const wxString& expr, int flags = wxRE_EXTENDED) creates a wxRegEx object with a compiled regular expression. The constructor will always create the object, even if your regular expression is invalid. Check wxRegEx::IsValid to determine if the regular expression was compiled successfully.

bool wxRegEx::Compile(const wxString& pattern, int flags = wxRE_EXTENDED) compiles a regular expression. You can call this method on any wxRegEx object, including one that already holds a compiled regular expression. Doing so will simply replace the regular expression held by the wxRegEx object. Pass your regular expression as a string as the first parameter. The second parameter allows you to set certain matching options.

To set the regex flavor, specify one of the flags wxRE_EXTENDED, wxRE_ADVANCED or wxRE_BASIC. If you specify a flavor, wxRE_EXTENDED is the default. I recommend you always specify the wxRE_ADVANCED flag. AREs are far more powerful than EREs. Every valid ERE is also a valid ARE, and will give identical results. The only reason to use the ERE flavor is when your code has to work when wxWidgets is compiled without the “built-in” regular expression library (i.e. Henry Spencer’s code).

You can set three other flags in addition to the flavor. wxRE_ICASE makes the regular expression case insensitive. The default is case sensitive. wxRE_NOSUB makes the regex engine treat all capturing groups as non-capturing. This means you won’t be able to use backreferences in the replacement text, or query the part of the regex matched by each capturing group. If you won’t be using these anyway, setting the wxRE_NOSUB flag improves performance.

As discussed in the Tcl section, Henry Spencer’s “ARE” regex engine did away with the confusing “single line” (?s) and “multi line” (?m) matching modes, replacing them with the equally confusing “non-newline-sensitive” (?s), “partial newline-sensitive” (?p), “inverse partial newline-sensitive” (?w) and “newline-sensitive matching” (?n). Since the wxRegEx class encapsulates the ARE engine, it supports all 4 modes when you use the mode modifiers inside the regular expression. But the flags parameter only allows you to set two.

If you add wxRE_NEWLINE to the flags, you’re turning on “newline-sensitive matching” (?n). In this mode, the dot will not match newline characters (\n). The caret and dollar will match after and before newlines in the string, as well as at the start and end of the subject string.

If you don’t set the wxRE_NEWLINE flag, the default is “non-newline-sensitive” (?s). In this mode, the dot will match all characters, including newline characters (\n). The caret and dollar will match only at the start and end of the subject string. Note that this default is different from the default in Perl and every other regex engine on the planet. In Perl, by default, the dot does not match newline characters, and the caret and dollar only match at the start and end of the subject string. The only way to set this mode in wxWidgets is to put (?p) at the start of your regex.

Putting it all together, wxRegex(_T("(?p)^[a-z].*$"), wxRE_ADVANCED + wxRE_ICASE) will check if your subject string consists of a single line that starts with a letter. The equivalent in Perl is m/^[a-z].*$/i.

wxRegEx Status Functions

wxRegEx::IsValid() returns true when the wxRegEx object holds a compiled regular expression.

wxRegEx::GetMatchCount() is rather poorly named. It does not return the number of matches found by Matches(). In fact, you can call GetMatchCount() right after Compile(), before you call Matches. GetMatchCount() it returns the number of capturing groups in your regular expression, plus one for the overall regex match. You can use this to determine the number of backreferences you can use the replacement text, and the highest index you can pass to GetMatch(). If your regex has no capturing groups, GetMatchCount() returns 1. In that case, \0 is the only valid backreference you can use in the replacement text.

GetMatchCount() returns 0 in case of an error. This will happen if the wxRegEx object does not hold a compiled regular expression, or if you compiled it with wxRE_NOSUB.

Finding and Extracting Matches

If you want to test whether a regex matches a string, or extract the substring matched by the regex, you first need to call the wxRegEx::Matches() method. It has 3 variants, allowing you to pass wxChar or wxString as the subject string. When using a wxChar, you can specify the length as a third parameter. If you don’t, wxStrLen() will be called to compute the length. If you plan to loop over all regex matches in a string, you should call wxStrLen() yourself outside the loop and pass the result to wxRegEx::Matches().

bool wxRegEx::Matches(const wxChar* text, int flags = 0) const bool wxRegEx::Matches(const wxChar* text, int flags, size_t len) const bool wxRegEx::Matches(const wxString& text, int flags = 0) const

Matches() returns true if the regex matches all or part of the subject string that you passed in the text parameter. Add anchors to your regex if you want to set whether the regex matches the whole subject string.

Do not confuse the flags parameter with the one you pass to the Compile() method or the wxRegEx() constructor. All the flavor and matching mode options can only be set when compiling the regex.

The Matches() method allows only two flags: wxRE_NOTBOL and wxRE_NOTEOL. If you set wxRE_NOTBOL, then ^ and \A will not match at the start of the string. They will still match after embedded newlines if you turned on that matching mode. Likewise, specifying wxRE_NOTEOL tells $ and \Z not to match at the end of the string.

wxRE_NOTBOL is commonly used to implement a “find next” routine. The wxRegEx class does not provide such a function. To find the second match in the string, you’ll need to call wxRegEx::Matches() and pass it the part of the original subject string after the first match. Pass the wxRE_NOTBOL flag to indicate that you’ve cut off the start of the string you’re passing.

wxRE_NOTEOL can be useful if you’re processing a large set of data, and you want to apply the regex before you’ve read the whole data. Pass wxRE_NOTEOL while calling wxRegEx::Matches() as long as you haven’t read the entire string yet. Pass both wxRE_NOTBOL and wxRE_NOTEOL when doing a “find next” on incomplete data.

After a call to Matches() returns true, and you compiled your regex without the wxRE_NOSUB flag, you can call GetMatch() to get details about the overall regex match, and the parts of the string matched by the capturing groups in your regex.

bool wxRegEx::GetMatch(size_t* start, size_t* len, size_t index = 0) const retrieves the starting position of the match in the subject string, and the number of characters in the match.

wxString wxRegEx::GetMatch(const wxString& text, size_t index = 0) const returns the text that was matched.

For both calls, set the index parameter to zero (or omit it) to get the overall regex match. Set 1 <= index < GetMatchCount() to get the match of a capturing group in your regular expression. To determine the number of a group, count the opening brackets in your regular expression from left to right.

Searching and Replacing

The wxRegEx class offers three methods to do a search-and-replace. Replace() is the method that does the actual work. You can use ReplaceAll() and ReplaceFirst() as more readable ways to specify the 3rd parameter to Replace().

int wxRegEx::ReplaceAll(wxString* text, const wxString& replacement) const replaces all regex matches in text with replacement.

int wxRegEx::ReplaceFirst(wxString* text, const wxString& replacement) const replaces the first match of the regular expression in text with replacement.

int wxRegEx::Replace(wxString* text, const wxString& replacement, size_t maxMatches = 0) const allows you to specify how many replacements will be made. Passing 0 for maxMatches or omitting it does the same as ReplaceAll(). Setting it to 1 does the same as ReplaceFirst(). Pass a number greater than 1 to replace only the first maxMatches matches. If text contains fewer matches than you’ve asked for, then all matches will be replaced, without triggering an error.

All three calls return the actual number of replacements made. They return zero if the regex failed to match the subject text. A return value of -1 indicates an error. The replacements are made directly to the wxString that you pass as the first parameter.

wxWidgets uses the same syntax as Tcl for the replacement text. You can use \0 as a placeholder for the whole regex match, and \1 through \9 for the text matched by one of the first nine capturing groups. You can also use & as a synonym of \0. Note that there’s no backslash in front of the ampersand. & is substituted with the whole regex match, while \& is substituted with a literal ampersand. Use \\ to insert a literal backslash. You only need to escape backslashes if they’re followed by a digit, to prevent the combination from being seen as a backreference. When specifying the replacement text as a literal string in C++ code, you need to double up all the backslashes, as the C++ compiler also treats backslashes as escape characters. So if you want to replace the match with the first backreference followed by the text &co, you’ll need to code that in C++ as _T("\\1\\&co").

| EditPad Lite | EditPad Pro |

| MySQL | Oracle | PostgreSQL |