Regex Tools

General Applications

Languages & Libraries

Boost

Delphi

Databases

Tcl Has Three Regular Expression Flavors

Tcl 8.2 and later support three regular expression flavors. The Tcl man pages dub them Basic Regular Expressions (BRE), Extended Regular Expressions (ERE) and Advanced Regular Expressions (ARE). BRE and ERE are mainly for backward compatibility with previous versions of Tcl. These flavor implement the two flavors defined in the POSIX standard. AREs are new in Tcl 8.2. They’re the default and recommended flavor. This flavor implements the POSIX ERE flavor, with a whole bunch of added features. Most of these features are inspired by similar features in Perl regular expressions.

Tcl’s regular expression support is based on a library developed for Tcl by Henry Spencer. This library has since been used in a number of other programming languages and applications, such as the PostgreSQL database and the wxWidgets GUI library for C++. Everything said about Tcl in this regular expressions tutorial applies to any tool that uses Henry Spencer’s Advanced Regular Expressions.

There are a number of important differences between Tcl Advanced Regular Expressions and Perl-style regular expressions. Tcl uses \m, \M, \y and \Y for word boundaries. Perl and most other modern regex flavors use \b and \B. In Tcl, these last two match a backspace and a backslash, respectively.

Tcl also takes a completely different approach to mode modifiers. The (?letters) syntax is the same, but the available mode letters and their meanings are quite different. Instead of adding mode modifiers to the regular expression, you can pass more descriptive switches like -nocase to the regexp and regsub commands for some of the modes. Mode modifier spans in the style of (?modes:regex) are not supported. Mode modifiers must appear at the start of the regex. They affect the whole regex. Mode modifiers in the regex override command switches. Tcl supports these modes:

(?i) or -nocase makes the regex match case insensitive.
(?c) makes the regex match case sensitive. This mode is the default.
(?x) or -expanded activates the free-spacing regexp syntax.
(?t) disables the free-spacing regexp syntax. This mode is the default. The “t” stands for “tight”, the opposite of “expanded”.
(?b) tells Tcl to interpret the remainder of the regular expression as a Basic Regular Expression.
(?e) tells Tcl to interpret the remainder of the regular expression as an Extended Regular Expression.
(?q) tells Tcl to interpret the remainder of the regular expression as plain text. The “q” stands for “quoted”.
(?s) selects “non-newline-sensitive matching”, which is the default. The “s” stands for “single line”. In this mode, the dot and negated character classes match all characters, including newlines. The caret and dollar match only at the very start and end of the subject string.
(?p) or -linestop enables “partial newline-sensitive matching”. In this mode, the dot and negated character classes do not match newlines. The caret and dollar match only at the very start and end of the subject string.
(?w) or -lineanchor enables “inverse partial newline-sensitive matching”. The “w” stands for “weird”. (Don’t look at me! I didn’t come up with this.) In this mode, the dot and negated character classes match all characters, including newlines. The caret and dollar match after and before newlines.
(?n) or -line enables what Tcl calls “newline-sensitive matching”. The dot and negated character classes do not match newlines. The caret and dollar match after and before newlines. Specifying (?n) or -line is the same as specifying (?pw) or -linestop -lineanchor.
(?m) is a historical synonym for (?n).

If you use regular expressions with Tcl and other programming languages, be careful when dealing with the newline-related matching modes. Tcl’s designers found Perl’s /m and /s modes confusing. They are confusing, but at least Perl has only two, and they both affect only one thing. In Perl, /m or (?m) enables “multi-line mode”, which makes the caret and dollar match after and before newlines. By default, they match at the very start and end of the string only. In Perl, /s or (?s) enables “single line mode”. This mode makes the dot match all characters, including line break. By default, it doesn’t match line breaks. Perl does not have a mode modifier to exclude line breaks from negated character classes. In Perl, [^a] matches anything except a, including newlines. The only way to exclude newlines is to write [^a\n]. Perl’s default matching mode is like Tcl’s (?p), except for the difference in negated character classes.

Why compare Tcl with Perl? Many popular regex flavors such as .NET, Java, PCRE and Python support the same (?m) and (?s) modifiers with the exact same defaults and effects as in Perl. Negated character classes work the same in all these languages and libraries. It’s unfortunate that Tcl didn’t follow Perl’s standard, since Tcl’s four options are just as confusing as Perl’s two options. Together they make a very nice alphabet soup.

If you ignore the fact that Tcl’s options affect negated character classes, you can use the following table to translate between Tcl’s newline modes and Perl-style newline modes. Note that the defaults are different. If you don’t use any switches, (?s). and . are equivalent in Tcl, but not in Perl.

Tcl	Perl	Anchors	Dot
`(?s)` (default)	`(?s)`	Start and end of string only	Any character
`(?p)`	(default)	Start and end of string only	Any character except newlines
`(?w)`	`(?sm)`	Start and end of string, and at newlines	Any character
`(?n)`	`(?m)`	Start and end of string, and at newlines	Any character except newlines

Regular Expressions as Tcl Words

You can insert regular expressions in your Tcl source code either by enclosing them with double quotes (e.g. "my regexp") or by enclosing them with curly braces (e.g. {my regexp}. Since the braces don’t do any substitution like the quotes, they’re by far the best choice for regular expressions.

The only thing you need to worry about is that unescaped braces in the regular expression must be balanced. Escaped braces don’t need to be balanced, but the backslash used to escape the brace remains part of the regular expression. You can easily satisfy these requirements by escaping all braces in your regular expression, except those used as a quantifier. This way your regex will work as expected, and you don’t need to change it at all when pasting it into your Tcl source code, other than putting a pair of braces around it.

The regular expression ^\{\d{3}\\$ matches a string that consists entirely of an opening brace, three digits and one backslash. In Tcl, this becomes {^\{\d+{3}$\\}. There’s no doubling of backslashes or any sort of escaping needed, as long as you escape literal braces in the regular expression. { and \{ are both valid regular expressions to match a single opening brace in a Tcl ARE (and any Perl-style regex flavor, for that matter). Only the latter works correctly in a Tcl literal enclosed with braces.

Finding Regex Matches

It Tcl, you can use the regexp command to test if a regular expression matches (part of) a string, and to retrieve the matched part(s). The syntax of the command is:

regexp ?switches? regexp subject ?matchvar? ?group1var group2var ...?

Immediately after the regexp command, you can place zero or more switches from the list above to indicate how Tcl should apply the regular expression. The only required parameters are the regular expression and the subject string. You can specify a literal regular expression using braces as I just explained. Or, you can reference any string variable holding a regular expression read from a file or user input.

If you pass the name of a variable as an additional argument, Tcl stores the part of the string matched by the regular expression into that variable. Tcl does not set the variable to an empty string if the match attempt fails. If the regular expressions has capturing groups, you can add additional variable names to capture the text matched by each group. If you specify fewer variables than the regex has capturing groups, the text matched by the additional groups is not stored. If you specify more variables than the regex has capturing groups, the additional variables are set to an empty string if the overall regex match was successful.

The regexp command returns 1 if (part of) the string could be matched, and zero if there’s no match. The following script applies the regular expression my regex case insensitively to the string stored in the variable subjectstring and displays the result:

if [
  regexp -nocase {my regex} $subjectstring matchresult
] then {
  puts $matchresult
} else {
  puts "my regex could not match the subject string"
}

The regexp command supports three more switches that aren’t regex mode modifiers. The -all switch causes the command to return a number indicating how many times the regex could be matched. The variables storing the regex and group matches will store the last match in the string only.

The -inline switch tells the regexp command to return an array with the substring matched by the regular expression and all substrings matched by all capturing groups. If you also specify the -all switch, the array will contain the first regex match, all the group matches of the first match, then the second regex match, the group matches of the first match, etc.

The -start switch must be followed by a number (as a separate Tcl word) that indicates the character offset in the subject string at which Tcl should attempt the match. Everything before the starting position will be invisible to the regex engine. This means that \A will match at the character offset you specify with -start, even if that position is not at the start of the string.

Replacing Regex Matches

With the regsub command, you can replace regular expression matches in a string.

regsub ?switches? regexp subject replacement ?resultvar?

Just like the regexp command, regsub takes zero or more switches followed by a regular expression. It supports the same switches, except for -inline. Remember to specify -all if you want to replace all matches in the string.

The argument after the regexp should be the replacement text. You can specify a literal replacement using the brace syntax, or reference a string variable. The regsub command recognizes a few metacharacters in the replacement text. You can use \0 as a placeholder for the whole regex match, and \1 through \9 for the text matched by one of the first nine capturing groups. You can also use & as a synonym of \0. Note that there’s no backslash in front of the ampersand. & is substituted with the whole regex match, while \& is substituted with a literal ampersand. Use \\ to insert a literal backslash. You only need to escape backslashes if they’re followed by a digit, to prevent the combination from being seen as a backreference. Again, to prevent unnecessary duplication of backslashes, you should enclose the replacement text with braces instead of double quotes. The replacement text \1 becomes {\1} when using braces, and "\\1" when using quotes.

If you pass a variable reference as the final argument, that variable receives the string with the replacements applied, and regsub returns an integer indicating the number of replacements made. Tcl 8.4 and later allow you to omit the final argument. In that case regsub returns the string with the replacements applied.

| EditPad Lite | EditPad Pro |

| MySQL | Oracle | PostgreSQL |