Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Regex Tools
grep
PowerGREP
RegexBuddy
RegexMagic
General Applications
EditPad Lite
EditPad Pro
Google Docs
Google Sheets
LibreOffice
Notepad++
Languages & Libraries
Boost
C#
Delphi
F#
GNU (Linux)
Groovy
ICU (Unicode)
Java
JavaScript
.NET
PCRE (C/C++)
PCRE2 (C/C++)
Perl
PHP
POSIX
PowerShell
Python
Python.NET and IronPython
R
RE2
Ruby
std::regex
Tcl
TypeScript
VBScript
Visual Basic 6
Visual Basic (.NET)
wxWidgets
XML Schema
XQuery & XPath
Xojo
XRegExp
Databases
Google BigQuery
MySQL
Oracle
PostgreSQL
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
RegexBuddy—The best regex editor and tester for icu developers!

International Components for Unicode (ICU)

The International Components for Unicode (or ICU for short) is an open source library that provides developers with comprehensive support for Unicode and globalization. It includes a regular expression engine with particularly strong support for matching Unicode properties. It is used in many applications and software platforms. Applications such as LibreOffice allow the user to use ICU regular expressions. The regular expressions tutorial and reference on this website cover the regular expression and replacement text syntax of ICU 55 and later. This page explains how you can use the ICU4C library to implement ICU regular expressions in your C++ and C applications. There is also an ICU4J library that allows Java developers to do the same.

C++ Code Using icu::RegexMatcher

For the C++ code snippets to work correctly, you need to include 3 header files:

#include <unicode/errorcode.h>
#include <unicode/regex.h>
#include <unicode/ustring.h>

You can compile a regular expression by constructing an instance of the RegexMatcher class:

icu::ErrorCode status;
icu::RegexMatcher matcher(L“my regex”, 0, status);

Call status.isSuccess() to check whether the regular expression was compiled successfully. If not then you have a syntax error in your regular expression. The specific value of status then indicates what kind of error it is. The following code snippets reuse the matcher and status variables.

The second parameter allows you to set matching modes outside the regular expression. The following constants are available. You can combine them using bitwise or. Pass 0 to use the default options.

To use the compiled regular expression, call matcher.reset(subject). The subject can be a UnicodeString or a UText instance. The regular expression will work directly on your subject string. No copy is made. You must make sure that you do not alter the subject string until you have either disposed of the matcher object or called reset() with another string. You can call the reset() method repeatedly on the same RegexMatcher instance with different subject strings to reuse the same regex with different input.

Call matcher.matches(status) to check whether the regular expression can match the subject string entirely. Call matcher.matches(index, status) to check whether the regular expression can find a match starting at position index in the subject string and ending at the end of the string.

To allow partial matches, call matcher.find(status) to find a match that can begin and end anywhere in the string. If you call find() multiple times without calling reset() then the second and following calls start their match attempts at the end of the match returned by the previous call. You can call find(index, status) to start the match attempt at a specific position in the string.

The matches() and find() member functions return TRUE if a match can be found. If they return FALSE, call status.isSuccess() to determine whether the regex simply did not match the string (TRUE) or whether an error occurred (FALSE).

To retrieve the part of the string that was actually matched by the regex, call matcher.group(status). To retrieve the text matched by a capturing group, call matcher.group(groupNum, status). Valid group numbers range from 0 to matcher.groupCount() (inclusive). Group 0 is the overall regex match. Group 1 is the first capturing group. Both these calls return a UnicodeString.

If you prefer to clone the match into a UText call matcher.group(dest, group_len, status) for the overall regex match or matcher.group(groupNum, dest, group_len, status) for a capturing group. The group_len argument must be a reference to an int64_t to receive the number of characters that were copied. If you pass a pointer to a mutable UText as dest then the match is cloned into that UText. If you pass NULL as dest then a new UText is created and returned by the member function. The returned UText may or may not be mutable.

The group() member function returns an empty string for non-participating capturing groups. To distinguish such groups from capturing groups that matched the empty string, call matcher.start(groupNum, status) to retrieve the index in the subject string where the group’s match begins. This returns -1 for a group that exists but did not participate in the match. It returns a position in the subject strings for groups that did participate in the match, including groups that found a zero-length match.

To retrieve the text matched by a named capturing group, call matcher.pattern().groupNumberFromName(L"group", status) to retrieve the number of the named group. Then call matcher.group() with the group’s number to actually retrieve the match.

matcher.replaceFirst(replacement, status) returns a new UnicodeString with the first regex match replaced. matcher.replaceAll(replacement, status) returns a new UnicodeString with all regex matches replaced. In both cases, replacement must be passed as a UnicodeString. The replacement is interpreted using the RE2 replacement text syntax. These two member functions operate on the string most recently passed to reset(). They always restart the matching process from the start of that string. Any previous calls to find() have no effect. You should call reset() after calling replaceFirst() or replaceAll() before doing anything else with the matcher instance to ensure consistent behavior.

If you prefer to store the result in a UText then call matcher.replaceFirst(replacement, dest, status) or matcher.replaceAll(replacement, dest, status). If you pass a pointer to a mutable UText as dest then the match is cloned into that UText. If you pass NULL as dest then a new UText is created and returned by the member function. The returned UText may or may not be mutable.

If you want more control over the search-and-replace operation then you can call find() in a loop to iterate over all regex matches. For each match, decide whether it should be replaced. Call matcher.appendReplacement(dest, replacement, status) for each match that you do want to replace. You can vary the replacement string with each call. Do nothing for matches that you don’t want to replace. Call matcher.appendTail(dest) after find() returns FALSE to complete the operation. You can pass a reference to a UnicodeString or a pointer to a mutable UText instance as dest to receive the output. You cannot pass NULL as dest.

To split a string, call matcher.split(subject, dest, destCapacity, status). You don’t need to call reset() before using split(). You can pass a UnicodeString as subject and an array of UnicodeString objects as dest. Alternatively, you can pass a UText as subject and an array of mutable UText structs as dest. Pass the number of elements in the array as destCapacity. This number must be 2 or more for split() to be able to do anything. The function returns the number of elements in the destination array that it actually filled. Any remaining elements are not altered. The first element is the part of the subject string before the first regex match. It will be an empty string if the first match starts at the start of the string. The following elements are the text between each regex match and the next match. Adjacent matches add empty strings to the array. The last element is the part of the subject string after the last regex match that was found. If the function returns destCapacity then this is the unsplit remainder of the subject string as split() stops after finding destCapacity-1 regex matches. This is not an error.

C Code Using uregex Functions

For the C code snippets to work correctly, you need to include 3 header files:

#include <unicode/errorcode.h>
#include <unicode/uregex.h>
#include <unicode/ustring.h>

You can compile a regular expression by calling the uregex_open function:

UErrorCode status;
URegularExpression *re = uregex_open(L“my regex”, -1, 0, NULL, &status);

You can pass the length of the regex in characters as the second parameter, or -1 if it is null-terminated.

Call U_SUCCESS(status) to check whether the regular expression was compiled successfully. If it was then you need to eventually call uregex_close(re) when you are done with the regular expression. If the regex was not compiled successfully then you have a syntax error in your regular expression. The specific value of status then indicates what kind of error it is. If you pass a pointer to a UParseError struct as the 4th parameter then this provides the position of the error in the regex. The following code snippets reuse the re and status variables.

The third parameter to uregex_open() allows you to set matching modes outside the regular expression. You can use the same constants as listen in the C++ section above. You can combine them using bitwise or. Pass 0 to use the default options.

To use the compiled regular expression, call uregex_setText(re, text, textLength, &status). Pass a UChar pointer as text. Pass the number of characters in the subject string as textLength or -1 if it is null-terminated. The regular expression will work directly on your subject string. No copy is made. You must make sure not to alter the subject string until you have either called uregex_setText() with another string or called uregex_close(re) to dispose of the regular expression. You can call the uregex_setText() function repeatedly on the same URegularExpression struct with different subject strings to reuse the same regex with different input.

Call uregex_matches(re, startIndex, &status) to check whether the regular expression can match the subject string beginning and startIndex and ending at the end of the string. Pass 0 as startIndex to require the regex to match the whole string. Pass -1 as startIndex to require the regex to match the region you set with a preceding call to uregex_setRegion(re, regionStart, regionLimit, &status).

To allow partial matches, call uregex_find(re, startIndex, &status) to find a match that can begin and end anywhere in the string at or after startIndex. Pass 0 as startIndex to allow a match anywhere in the string. Pass -1 as startIndex to require the match to be within the region set by a preceding call to uregex_setRegion(). After uregex_find() has found a match, you can call uregex_findNext(re, &status) to look for another match in the remainder of the string after the previous match. You can call uregex_findNext() in a loop to iterate over all matches.

The uregex_matches(), uregex_find(), and uregex_findNext() functions return TRUE if a match can be found. If they return FALSE, call U_SUCCESS(status) to determine whether the regex simply did not match the string (TRUE) or whether an error occurred (FALSE).

To retrieve the part of the string that was actually matched by the regex, call uregex_getUText(re, dest, &status) for the overall regex match or uregex_groupUText(re, groupNum, dest, groupLength, &status) for a capturing group. The groupLength argument must be a reference to an int64_t to receive the number of characters that were copied. If you pass a pointer to a mutable UText as dest then the match is cloned into that UText. If you pass NULL as dest then a new UText is created as an immutable shallow clone of the entire input string. The functions return dest if it was not NULL. Otherwise they return the newly created UText.

To retrieve the text matched by a named capturing group, call uregex_groupNumberFromName(re, L"group", -1, &status) to retrieve the number of the named group. Then call uregex_groupUText() with the group’s number to actually retrieve the match.

To distinguish between a non-participating group and a capturing group that matched the empty string, call uregex_start(re, groupNum, &status) to retrieve the index in the subject string where the group’s match begins. This returns -1 for a group that exists but did not participate in the match. It returns a position in the subject strings for groups that did participate in the match, including groups that found a zero-length match.

uregex_replaceFirstUText(re, replacement, dest, &status) returns a UText with the first regex match replaced. uregex_replaceAllUText(re, replacement, dest, &status) returns a UText with all regex matches replaced. Pass the replacement text as a UText using the RE2 replacement text syntax. If you pass a pointer to a mutable UText as dest then the match is cloned into that UText and dest is returned. If you pass NULL as dest then a new UText is created and returned by the member function. The returned UText may or may not be mutable.

If you want more control over the search-and-replace operation then you can call uregex_find() in a loop to iterate over all regex matches. For each match, decide whether it should be replaced. Call uregex_appendReplacementUText(re, replacement, dest, &status) for each match that you do want to replace. You can vary the replacement string with each call. Do nothing for matches that you don’t want to replace. Call uregex_appendTailUText(re, dest, &status) after find() returns FALSE to complete the operation. You can pass a reference to a UnicodeString or a pointer to a mutable UText instance as dest to receive the output. You cannot pass NULL as dest.

To split a string, call uregex_splitUText(re, destFields, destFieldsCapacity, &status). Pass an array of mutable UText structs as destFields. If an element in the array is NULL then a new UText is allocated to fill that element. This new UText may or may not be mutable. Pass the number of elements in the array as destFieldsCapacity. This number must be 2 or more to actually split the string. The function returns the number of elements in the destination array that it actually filled. Any remaining elements are not altered. The first element is the part of the subject string before the first regex match. It will be an empty string if the first match starts at the start of the string. The following elements are the text between each regex match and the next match. Adjacent matches add empty strings to the array. The last element is the part of the subject string after the last regex match that was found. If the function returns destFieldsCapacity then this is the unsplit remainder of the subject string as uregex_splitUText() stops after finding destFieldsCapacity-1 regex matches. This is not an error.

| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |

| grep | PowerGREP | RegexBuddy | RegexMagic |

| EditPad Lite | EditPad Pro | Google Docs | Google Sheets | LibreOffice | Notepad++ |

| Boost | C# | Delphi | F# | GNU (Linux) | Groovy | ICU (Unicode) | Java | JavaScript | .NET | PCRE (C/C++) | PCRE2 (C/C++) | Perl | PHP | POSIX | PowerShell | Python | Python.NET and IronPython | R | RE2 | Ruby | std::regex | Tcl | TypeScript | VBScript | Visual Basic 6 | Visual Basic (.NET) | wxWidgets | XML Schema | XQuery & XPath | Xojo | XRegExp |

| Google BigQuery | MySQL | Oracle | PostgreSQL |