Quick Start
Tutorial
Search & Replace
Tools & Languages
Examples
Reference
Examples
Regular Expressions Examples
Numeric Ranges
Floating Point Numbers
Email Addresses
IP Addresses
Valid Dates
Numeric Dates to Text
Credit Card Numbers
Matching Complete Lines
Deleting Duplicate Lines
Programming
Homographs
Two Near Words
Pitfalls
Catastrophic Backtracking
Too Many Repetitions
Denial of Service
Making Everything Optional
Repeated Capturing Group
Mixing Unicode & 8-bit
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP—The world’s most powerful tool to flex your regex muscles!
RegexBuddy—Better than a regular expression tutorial!

Homographs in International Domain Names

What’s the difference between apple.com and аррӏе.com? Both bring you to Apple’s website, but the latter might trigger a warning in your browser. If we use a different font as in apple.com and аррӏе.com then you might spot the difference. The regex ^[a-z]+\.com$ matches the former but not the latter. ^\p{Cyrillic}+\.com$ matches the latter. The second domain name consists entirely of letters from the Cyrillic script that just happen to look very much like letters from the Latin script. Such lookalike letters are called homographs.

For the internet to be equally accessible to anyone, including to people for whom Latin letters are just as weird and foreign as Cyrillic letters might be to westerners, domain names can’t be restricted to Latin characters. българия.com (bulgaria.com; personal website of a software developer) and здоровое-питание.рф (Russian government website promoting healthy eating) are perfectly legitimate domain names. Your browser will display these domains and their websites normally.

But what about аpple.com and applе.com? No, these are not the same two strings from the first paragraph. The two regexes above match neither of these. If you paste them into your browser you’ll see xn--pple-43d.com and xn--appl-y4d.com. Neither domain is registered at the time of writing. Neither domain should be. In the first one, а is a Cyrillic letter. In the second one, е is a Cyrillic letter. All the other letters are ASCII. Domain names that mix homographs from different scripts have been used in phishing attacks to lure people to fake websites. When browsers detect a mixture of scripts in a domain name they convert it to Punycode, which is a method for representing internationalized domains names in plain ASCII.

Detecting Homographs with Regular Expressions

We need a way to detect domain names that mix letters from different scripts as they are extremely likely to be fraudulent. But we still need to allow българия.com which is an all-Cyrillic domain on the classic .com top-level domain. We also need to allow здоровое-питание.рф which has only Cyrillic letters but still has an ASCII hyphen and dot.

Perl 5.28 introduced a new regular expression concept called a script run that is perfect for this. PCRE2 10.33 adopted this feature, which then brought it to PHP 7.4.0 and R 4.0.0.

A script run requires all characters to be part of the same script. The actual rules are a little more complicated, but that’s what they come down to. A script run can contain Latin letters or Cyrillic letters, but not both. The rules ignore characters that are commonly shared among scripts, which includes all ASCII punctuation, so they don’t break up script runs. In addition, if the run contains digits then they must all be part of a single range of 10 digits. A run can’t mix the ASCII digits 0 to 9 with the ideographic width digits 0 to 9, for example, even though both these digit ranges are “common” characters.

Regular Expressions to Prevent Homographs

The regular expression ^(*asr:[\p{L}\p{N}]++(?:-[\p{L}\p{N}]++)*+\.)++(*asr:[\p{L}\p{N}]{2,}+)$ checks whether a string consists entirely of a domain name that does not mix characters from different scripts in any of the dotted parts. It does allow a dotted part to use a different script than another dotted part. So it will match българия.com and аррӏе.com that have all-Cyrillic domains on a Latin top-level domain. It will also match Ελλάδα.българия.com with a Greek sub-domain.

To implement this, the regex uses a separate script run to match each of the dotted parts. The first one is repeated to allow any number of sub-domains. The script run checks its contents separately for each repetition. The first iteration can accept Ελλάδα. as a valid script run with Greek letters and the second iteration can accept българия. with Cyrillic letters.

You can change regex to ^(*asr:(?:[\p{L}\p{N}]++(?:-[\p{L}\p{N}]++)*+\.)++)(*asr:[\p{L}\p{N}]{2,}+)$ if you want all the sub-domains to use the same script as the domain. The repeated group that matches each sub-domain is now a non-capturing group. It is wrapped inside a script run so that the entire string of sub-domains is validated as a single script run. The top-level domain still gets its own script run. This regex still matches българия.com but not Ελλάδα.българия.com.

If you want the entire domain, including the top-level domain, to use the same script then you can combine the two script runs from the preceding regex into one. ^(*asr:(?:[\p{L}\p{N}]++(?:-[\p{L}\p{N}]++)*+\.)++[\p{L}\p{N}]{2,}+)$ matches apple.com with only Latin letters and здоровое-питание.рф with only Cyrillic letters. It doesn’t match any of the other domain names on this page.

All these regexes use possessive quantifiers to make them more efficient. It doesn’t change the results. The regexes don’t really backtrack anyway because each part of these regexes is mutually exclusive with whatever may follow. The possessive quantifiers tell the regex engine to not even try.

Because the quantifiers are all possessive, none of the script runs have any contents that could ever backtrack. So it doesn’t matter whether the script runs are atomic or not. We’ve made them atomic anyway because in most cases where you want to make sure something consists of a single script you don’t want the script run to backtrack to try to find a shorter match. Thus it’s a good habit to use atomic script runs by default. Use non-atomic script runs only when you have a clear reason to do so.