| Pitfalls |
| Catastrophic Backtracking |
| Too Many Repetitions |
| Denial of Service |
| Making Everything Optional |
| Repeated Capturing Group |
| Mixing Unicode & 8-bit |
What’s the difference between apple.com and аррӏе.com? Both bring you to Apple’s website, but the latter might trigger a warning in your browser. If we use a different font as in apple.com and аррӏе.com then you might spot the difference. The regex ^[a-z]+\.com$ matches the former but not the latter. ^\p{Cyrillic}+\.com$ matches the latter. The second domain name consists entirely of letters from the Cyrillic script that just happen to look very much like letters from the Latin script. Such lookalike letters are called homographs.
For the internet to be equally accessible to anyone, including to people for whom Latin letters are just as weird and foreign as Cyrillic letters might be to westerners, domain names can’t be restricted to Latin characters. българия.com (bulgaria.com; personal website of a software developer) and здоровое-питание.рф (Russian government website promoting healthy eating) are perfectly legitimate domain names. Your browser will display these domains and their websites normally.
But what about аpple.com and applе.com? No, these are not the same two strings from the first paragraph. The two regexes above match neither of these. If you paste them into your browser you’ll see xn--pple-43d.com and xn--appl-y4d.com. Neither domain is registered at the time of writing. Neither domain should be. In the first one, а is a Cyrillic letter. In the second one, е is a Cyrillic letter. All the other letters are ASCII. Domain names that mix homographs from different scripts have been used in phishing attacks to lure people to fake websites. When browsers detect a mixture of scripts in a domain name they convert it to Punycode, which is a method for representing internationalized domains names in plain ASCII.
We need a way to detect domain names that mix letters from different scripts as they are extremely likely to be fraudulent. But we still need to allow българия.com which is an all-Cyrillic domain on the classic .com top-level domain. We also need to allow здоровое-питание.рф which has only Cyrillic letters but still has an ASCII hyphen and dot.
Perl 5.28 introduced a new regular expression concept called a script run that is perfect for this. PCRE2 10.33 adopted this feature, which then brought it to PHP 7.4.0 and R 4.0.0.
A script run requires all characters to be part of the same script. The actual rules are a little more complicated, but that’s what they come down to. A script run can contain Latin letters or Cyrillic letters, but not both. The rules ignore characters that are commonly shared among scripts, which includes all ASCII punctuation, so they don’t break up script runs. In addition, if the run contains digits then they must all be part of a single range of 10 digits. A run can’t mix the ASCII digits 0 to 9 with the ideographic width digits 0 to 9, for example, even though both these digit ranges are “common” characters.
The regular expression ^(*asr:[\p{L}\p{N}
To implement this, the regex uses a separate script run to match each of the dotted parts. The first one is repeated to allow any number of sub-domains. The script run checks its contents separately for each repetition. The first iteration can accept Ελλάδα. as a valid script run with Greek letters and the second iteration can accept българия. with Cyrillic letters.
You can change regex to ^(*asr:(?:[\p{L}\p{N}
If you want the entire domain, including the top-level domain, to use the same script then you can combine the two script runs from the preceding regex into one. ^(*asr:(?:[\p{L}\p{N}
All these regexes use possessive quantifiers to make them more efficient. It doesn’t change the results. The regexes don’t really backtrack anyway because each part of these regexes is mutually exclusive with whatever may follow. The possessive quantifiers tell the regex engine to not even try.
Because the quantifiers are all possessive, none of the script runs have any contents that could ever backtrack. So it doesn’t matter whether the script runs are atomic or not. We’ve made them atomic anyway because in most cases where you want to make sure something consists of a single script you don’t want the script run to backtrack to try to find a shorter match. Thus it’s a good habit to use atomic script runs by default. Use non-atomic script runs only when you have a clear reason to do so.
| Quick Start | Tutorial | Search & Replace | Tools & Languages | Examples | Reference |
| Regular Expressions Examples | Numeric Ranges | Floating Point Numbers | Email Addresses | IP Addresses | Valid Dates | Numeric Dates to Text | Credit Card Numbers | Matching Complete Lines | Deleting Duplicate Lines | Programming | Homographs | Two Near Words |
| Catastrophic Backtracking | Too Many Repetitions | Denial of Service | Making Everything Optional | Repeated Capturing Group | Mixing Unicode & 8-bit |
Page URL: https://www.regular-expressions.info/homographs.html
Page last updated: 19 June 2025
Site last updated: 29 October 2025
Copyright © 2003-2025 Jan Goyvaerts. All rights reserved.