Quick Start
Tools & Languages
Book Reviews
Regex Tools
General Applications
EditPad Lite
EditPad Pro
Languages & Libraries
GNU (Linux)
PCRE (C/C++)
PCRE2 (C/C++)
Visual Basic 6
XML Schema
XQuery & XPath
More on This Site
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
RegexBuddy—The best regex editor and tester for XML developers!

XML Schema Regular Expressions

The W3C XML Schema standard defines its own regular expression flavor. You can use it in the pattern facet of simple type definitions in your XML schemas. E.g. the following defines the simple type “SSN” using a regular expression to require the element to contain a valid US social security number.

<xsd:simpleType name="SSN">
    <xsd:restriction base="xsd:token">
        <xsd:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>

Compared with other regular expression flavors, the XML schema flavor is quite limited in features. Since it’s only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you won’t really miss the features often found in other flavors. The limitations allow schema validators to be implemented with efficient text-directed engines.

Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries, and lookaround. XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid. If you have the pattern regexp, the XML schema validator will apply it in the same way as say Perl, Java or .NET would do with the pattern ^regexp$. If you want to accept all elements with regex somewhere in the middle of their contents, you’ll need to use the regular expression .*regex.*. The two .* expand the match to cover the whole element, assuming it doesn’t contain line breaks. If you want to allow line breaks, you can use something like [\s\S]*regex[\s\S]*. Combining a shorthand character class with its negated version results in a character class that matches anything.

XML schemas do not provide a way to specify matching modes. The dot never matches line breaks, and patterns are always applied case sensitively. If you want to apply literal case insensitively, you’ll need to rewrite it as [lL][iI][tT][eE][rR][aA][lL].

XML regular expressions don’t have any tokens like \xFF or \uFFFF to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the &#xFFFF; XML syntax. Otherwise, you’ll need to paste in the characters from a character map.

Lazy quantifiers are not available. Since the pattern is anchored at the start and the end of the subject string anyway, and only a success/failure result is returned, the only potential difference between a greedy and lazy quantifier would be performance. You can never make a fully anchored pattern match or fail by changing a greedy quantifier into a lazy one or vice versa.

XML Schema regular expressions support the following:

Note that the regular expression functions available in XQuery and XPath use a different regular expression flavor. This flavor is a superset of the XML Schema flavor described here. It adds some of the features that are available in many modern regex flavors, but not in the XML Schema flavor.

XML Character Classes

Despite its limitations, XML schema regular expressions introduce two handy features. The special short-hand character classes \i and \c make it easy to match XML names. No other regex flavor supports these.

Character class subtraction makes it easy to match a character that is in a certain list, but not in another list. E.g. [a-z-[aeiou]] matches an English consonant. This feature is now also available in the JGsoft and .NET regex engines. It is particularly handy when working with Unicode properties. E.g. [\p{L}-[\p{IsBasicLatin}]] matches any letter that is not an English letter.

| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |

| grep | PowerGREP | RegexBuddy | RegexMagic |

| EditPad Lite | EditPad Pro |

| Boost | Delphi | GNU (Linux) | Groovy | Java | JavaScript | .NET | PCRE (C/C++) | PCRE2 (C/C++) | Perl | PHP | POSIX | PowerShell | Python | R | Ruby | std::regex | Tcl | VBScript | Visual Basic 6 | wxWidgets | XML Schema | Xojo | XQuery & XPath | XRegExp |

| MySQL | Oracle | PostgreSQL |