RegexBuddy—The best regular expression debugger!

Mixing Unicode and 8-bit Character Codes

Internally, computers deal with numbers, not with characters. When you save a text file, each character is mapped to a number, and the numbers are stored on disk. When you open a text file, the numbers are read and mapped back to characters. When processing text with a regular expression, the regular expression needs to use the same mapping as you used to create the file or string you want the regex to process.

When you simply type in all the characters in your regular expression, you normally don’t have anything to worry about. The application or programming library that provides the regular expression functionality will know what text encodings your subject string uses, and process it accordingly. So if you want to search for the euro currency symbol, and you have a European keyboard, just press AltGr+E. Your regex € will find all euro symbols just fine.

But you can’t press AltGr+E on a US keyboard. Or perhaps you like your source code to be 7-bit clean (i.e. plain ASCII). In those cases, you’ll need to use a character escape in your regular expression.

If your regular expression engine supports Unicode, simply use the Unicode escape \u20AC (most Unicode flavors) or \x{20AC} (Perl and PCRE). U+20AC is the Unicode code point for the euro symbol. It will always match the euro symbol, whether your subject string is encoded in UTF-8, UTF-16, UCS-2 or whatever. Even when your subject string is encoded with a legacy 8-bit code page, there’s no confusion. You may need to tell the application or regex engine what encoding your file uses. But \u20AC is always the euro symbol.

Most Unicode regex engines also support the 8-bit character escape \xFF. However, its use is not recommended. For characters \x00 through \x7F, there’s usually no trouble. The first 128 Unicode code points are identical to the ASCII table that most 8-bit code pages are based on.

But the interpretation of \x80 and above may vary. A pure Unicode engine will treat this identical to \u0080, which represents a Latin-1 control code. But what most people expect is that \x80 matches the euro symbol, as that occupies position 80h in all Windows code pages. And it will when using an 8-bit regex engine if your text file is encoded using a Windows code page.

Since most people expect \x80 to be treated as an 8-bit character rather than the Unicode code point \u0080, some Unicode regex engines do exactly that. Some are hard-wired to use a particular code page, say Windows 1252 or your computer’s default code page, to interpret 8-bit character codes.

Other engines will let it depend on the input string. Just Great Software applications treat \x80 as \u0080 when searching through a Unicode text file, but as \u20AC when searching through a Windows 1252 text file. There’s no magic here. It matches the character with index 80h in the text file, regardless of the text file’s encoding. Unicode code point U+0080 is a Latin-1 control code, while Windows 1252 character index 80h is the euro symbol. In reverse, if you type in the euro symbol in a text editor, saving it as UTF-16LE will save two bytes AC 20, while saving as Windows 1252 will give you one byte 80.

If you find the above confusing, simply don’t use \x80 through \xFF with a regex engine that supports Unicode.

8-bit Regex Engines

When working with a legacy (obsolete?) regular expression engine that works on 8-bit data only, you can’t use Unicode escapes like \u20AC. \x80 is all you have. Note that even modern engines have legacy modes. The popular regex library PCRE, for example, runs as an 8-bit engine by default. You need to explicitly enable UTF-8 support if you want to use Unicode features. When you do, PCRE also expects you to convert your subject strings to UTF-8.

When crafting a regular expression for an 8-bit engine, you’ll have to take into account which character set or code page you’ll be working with. 8-bit regex engines just don’t care. If you type \x80 into your regex, it will match any byte 80h, regardless of what that byte represents. That’ll be the euro symbol in a Windows 1252 text file, a control code in a Latin-1 file, and the digit zero in an EBCDIC file.

Even for literal characters in your regex, you’ll have to match up the encoding you’re using in the regular expression with the subject encoding. If your application is using the Latin-1 code page, and you use the regex À, it’ll match Ŕ when you search through a Latin-2 text file. The application would duly display this as À on the screen, because it’s using the wrong code page. This problem is not really specific to regular expressions. You’ll encounter it any time you’re working with files and applications that use different 8-bit encodings.

So when working with 8-bit data, open the actual data you’re working with in a hex editor. See the bytes being used, and specify those in your regular expression.

Where it gets really hairy is if you’re processing Unicode files with an 8-bit engine. Let’s go back to our text file with just a euro symbol. When saved as little endian UTF-16 (called “Unicode” on Windows), an 8-bit regex engine will see two bytes AC 20 (remember that little endian reverses the bytes). When saved as UTF-8 (which has no endianness), our 8-bit engine will see three bytes E2 82 AC. You’d need \xE2\x82\xAC to match the euro symbol in an UTF-8 file with an 8-bit regex engine.