TutorialTools & LanguagesExamplesBooks & Reference
RegexBuddy Easily use the power of regular expressions in R with RegexBuddy.
Create and analyze regex patterns with RegexBuddy's intuitive regex building blocks. Implement regexes in your applications with instant R code snippets. Just tell RegexBuddy what you want to achieve, and copy and paste the auto-generated R code. Get your own copy of RegexBuddy now.

Regular Expressions with The R Language

The R Project for Statistical Computing provides five regular expression functions in its base package. All these functions support three regular expression flavors. You have two parameters called extended and perl at your disposal to indicate the flavor you want.

If you omit these parameters, extended is TRUE, and perl is FALSE. Then the default flavor, GNU Extended Regular Expressions, is used. R's documentation says it implements the POSIX standard for regular expressions, but actually it uses the GNU regex library, which is an extension of POSIX. If you set both parameters to FALSE, the GNU Basic Regular Expressions are used. Despite their names, GNU ERE and GNU BRE actually implement the same limited set of features. Only the syntax is slightly different.

For maximum regex functionality, set the perl parameter to TRUE. The extended parameter is then ignored. This tells R to use the PCRE regular expressions library.

Finding Regex Matches in String Vectors

The grep function takes your regex as the first argument, and the input vector as the second argument. Use the 3rd argument to make the regex case insensitive (TRUE) or case sensitive (FALSE). Arguments 4 and 5 are the extended and perl arguments to select the regex flavor. The 6th argument is the value parameter. If you set it to FALSE or omit it, grep returns a new vector with the indices of the elements in the input vector that could be (partially) matched by the regular expression. If you set value to TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.

> grep("a+", c("abc", "def", "cba a", "aa"), value=FALSE)
[1] 1     3       4
> grep("a+", c("abc", "def", "cba a", "aa"), value=TRUE)
[1] "abc" "cba a" "aa"

The regexpr function takes the same arguments as the grep function, except for the value argument, which is not supported. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn't match.

gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as the input vector. Each element is another vector, with one element for each match found in the string indicating the character position at which that match was found. Each vector element in the returned vector also has a match.length attribute with the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a vector, but with just one element -1.

> regexpr("a+", c("abc", "def", "cba a", "aa"))
[1]  1 -1  3  1
attr(,"match.length")
[1]  1 -1  1  2
> gregexpr("a+", c("abc", "def", "cba a", "aa"))
[[1]]  [1] 1    attr(,"match.length")  [1] 1
[[2]]  [1] -1   attr(,"match.length")  [1] -1
[[3]]  [1] 3 5  attr(,"match.length")  [1] 1 1
[[4]]  [1] 1    attr(,"match.length")  [1] 2

Replacing Regex Matches in String Vectors

The sub function has three required parameters: a string with the regular expression, a string with the replacement text, and the input vector. Use the 4th argument to make the regex case insensitive (TRUE) or case sensitive (FALSE). Arguments 5 and 6 are the extended and perl arguments to select the regex flavor.

sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in some strings, those are copied into the result vector unchanged.

Use gsub instead of sub to replace all regex matches in all the string elements in your vector. Other than replacing all matches, gsub works in exactly the same way, and takes exactly the same arguments.

You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1.

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"))
[1] "zazbc"  "def"  "cbzaz a"   "zaaz"   
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"))
[1] "zazbc"  "def"  "cbzaz zaz" "zaaz"

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Regex Tools
grep
PowerGREP
RegexBuddy
RegexMagic
General Applications
EditPad Pro
Languages & Libraries
Delphi
GNU (Linux)
Groovy
Java
JavaScript
.NET
PCRE (C/C++)
Perl
PHP
POSIX
PowerShell
Python
R
REALbasic
Ruby
Tcl
VBScript
Visual Basic 6
wxWidgets
XML Schema
XQuery & XPath
Databases
MySQL
Oracle
PostgreSQL
More Information
Introduction
Quick Start
Tutorial
Tools and Languages
Examples
Books
Reference
Print PDF
About This Site
RSS Feed & Blog