PostgreSQL Has Three Regular Expression Flavors

PostgreSQL 7.4 and later use the exact same regular expression engine that was developed by Henry Spencer for Tcl 8.2. This means that PostgreSQL supports the same three regular expressions flavors: Tcl Advanced Regular Expressions, POSIX Extended Regular Expressions and POSIX Basic Regular Expressions. Just like in Tcl, AREs are the default. All my comments on Tcl’s regular expression flavor, like the unusual mode modifiers and word boundary tokens, fully apply to PostgreSQL. You should definitely review them if you’re not familiar with Tcl’s AREs. Unfortunately, PostgreSQL’s regexp_replace function does not use the same syntax for the replacement text as Tcl’s regsub command, however.

PostgreSQL versions prior to 7.4 supported POSIX Extended Regular Expressions only. If you are migrating old database code to a new version of PostgreSQL, you can set PostgreSQL’s “regex_flavor” run-time parameter to “extended” instead of the default “advanced” to make EREs the default.

PostgreSQL also supports the traditional SQL LIKE operator, and the SQL:1999 SIMILAR TO operator. These use their own pattern languages, which are not discussed here. AREs are far more powerful, and no more complicated if you don’t use functionality not offered by LIKE or SIMILAR TO.

The Tilde Operator

The tilde infix operator returns true or false depending on whether a regular expression can match part of a string, or not. E.g. 'subject' ~ 'regexp' returns false, while 'subject' ~ '\\w' returns true. If the regex must match the whole string, you’ll need to use anchors. E.g. 'subject' ~ '^\\w$' returns false, while 'subject' ~ '^\\w+$' returns true. There are 4 variations of this operator:

~ attempts a case sensitive match
~* attempts a case insensitive match
!~ attempts a case sensitive match, and returns true if the regex does not match any part of the subject string
!~* attempts a case insensitive match, and returns true if the regex does not match any part of the subject string

While only case sensitivity can be toggled by the operator, all other options can be set using mode modifiers at the start of the regular expression. Mode modifiers override the operator type. E.g. ‘(?c)regex’ forces the to be regex case sensitive.

The most common use of this operator is to select rows based on whether a column matches a regular expression, e.g.:

select * from mytable where mycolumn ~* 'regexp'

Regular Expressions as Literal PostgreSQL Strings

The backslash is used to escape characters in PostgreSQL strings. So a regular expression like \w that contains a backslash becomes '\\w' when written as a literal string in a PostgreSQL statement. To match a single literal backslash, you’ll need the regex \\ which becomes '\\\\' in PostgreSQL.

PostgreSQL Regexp Functions

With the substring(string from pattern) function, you can extract part of a string or column. It takes two parameters: the string you want to extract the text from, and the pattern the extracted text should match. If there is no match, substring() returns null. E.g. substring('subject' from 'regexp') returns null. If there is a match, and the regex has one or more capturing groups, the text matched by the first capturing group is returned. E.g. substring('subject' from 's(\\w)') returns ‘u’. If there is a match, but the regex has no capturing groups, the whole regex match is returned. E.g. substring('subject' from 's\\w') returns ‘su’. If the regex matches the string more than once, only the first match is returned. Since the substring() function doesn’t take a “flags” parameter, you’ll need to toggle any matching options using mode modifiers.

This function is particularly useful to extract information from columns. E.g. to extract the first number from the column mycolumn for each row, use:

select substring(mycolumn from '\d+') from mytable

With regexp_replace(subject, pattern, replacement [, flags]) you can replace regex matches in a string. If you omit the flags parameter, the regex is applied case sensitively, and only the first match is replaced. If you set the flags to 'i', the regex is applied case insensitively. The 'g' flag (for “global”) causes all regex matches in the string to be replaced. You can combine both flags as 'gi'.

You can use the backreferences \1 through \9 in the replacement text to re-insert the text matched by a capturing group into the regular expression. \& re-inserts the whole regex match. Remember to double up the backslashes in literal strings.

E.g. regexp_replace('subject', '(\w)\w', '\&\1', 'g') returns 'susbjbecet'.

PostgreSQL 8.3 and later have two new functions to split a string along its regex matches. regexp_split_to_table(subject, pattern[, flags]) returns the split string as a new table. regexp_split_to_array(subject, pattern[, flags]) returns the split string as an array of text. If the regex finds no matches, both functions return the subject string.

| EditPad Lite | EditPad Pro |

| MySQL | Oracle | PostgreSQL |