Quick Start
Tutorial
Tools & Languages
Examples
Reference
Book Reviews
RegexBuddy Easily use the power of regular expressions in your Java applications with RegexBuddy.
Create and analyze regex patterns with RegexBuddy's intuitive regex building blocks. Implement regexes in your applications with instant Java code snippets. Just tell RegexBuddy what you want to achieve, and copy and paste the auto-generated Java code. Get your own copy of RegexBuddy now.

Using Regular Expressions in Java

Java 4 (JDK 1.4) and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun's regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

Java 5 fixes some bugs and adds support for Unicode blocks. Java 6 fixes a few more bugs but doesn't add any features. Java 7 adds named capture and Unicode scripts.

Quick Regex Methods of The String Class

The Java String class has several methods that allow you to perform an operation using a regular expression on that string in a minimal amount of code. The downside is that you cannot specify options such as "case insensitive" or "dot matches newline". For performance reasons, you should also not use these methods if you will be using the same regular expression often.

myString.matches("regex") returns true or false depending whether the string can be matched entirely by the regular expression. It is important to remember that String.matches() only returns true if the entire string can be matched. In other words: "regex" is applied as if you had written "^regex$" with start and end of string anchors. This is different from most other regex libraries, where the "quick match test" method returns true if the regex can be matched anywhere in the string. If myString is abc then myString.matches("bc") returns false. bc matches abc, but ^bc$ (which is really being used here) does not.

myString.replaceAll("regex", "replacement") replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal "2" if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal "2" in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: "\\$".

myString.split("regex") splits the string at each regex match. The method returns an array of strings where each element is a part of the original string between two regex matches. The matches themselves are not included in the array. Use myString.split("regex", n) to get an array containing at most n items. The result is that the string is split at most n-1 times. The last item in the string is the unsplit remainder of the original string.

Using The Pattern Class

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string (see previous paragraph), you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
  if (checkIfThisMatchShouldBeReplaced()) {
    myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
  }
}
myMatcher.appendTail(myStringBuffer);

Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Regular Expressions, Literal Strings and Backslashes

In literal Java strings the backslash is an escape character. The literal string "\\" is a single backslash. In regular expressions, the backslash is also an escape character. The regular expression \\ matches a single backslash. This regular expression as a Java string, becomes "\\\\". That's right: 4 backslashes to match a single one.

The regex \w matches a word character. As a Java string, this is written as "\\w".

The same backslash-mess occurs when providing replacement strings for methods like String.replaceAll() as literal Java strings in your Java code. In the replacement text, a dollar sign must be encoded as \$ and a backslash as \\ when you want to replace the regex match with an actual dollar sign or backslash. However, backslashes must also be escaped in literal Java strings. So a single dollar sign in the replacement text becomes "\\$" when written as a literal Java string. The single backslash becomes "\\\\". Right again: 4 backslashes to insert a single one.

Further Reading

Java Regular Expressions - Taming the java.util.regex EngineIf you'd like a more detailed overview of all the functionality offered by the java.util.regex package, you may want to get yourself a copy of "Java Regular Expressions" written by Mehran Habibi and published by Apress. While the regular expressions tutorial on this website is more complete and more detailed, Mr. Habibi's book contains a more slow-paced regular expressions tutorial. The book definitely provides more information and examples on the java.util.regex package than this single web page can provide.

My review of the book Java Regular Expressions

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Regex Tools
grep
PowerGREP
RegexBuddy
RegexMagic
General Applications
EditPad Lite
EditPad Pro
Languages & Libraries
Delphi
GNU (Linux)
Groovy
Java
JavaScript
.NET
PCRE (C/C++)
Perl
PHP
POSIX
PowerShell
Python
R
Ruby
Tcl
VBScript
Visual Basic 6
wxWidgets
XML Schema
Xojo
XQuery & XPath
XRegExp
Databases
MySQL
Oracle
PostgreSQL
More on This Site
Introduction
Regular Expressions Quick Start
Regular Expressions Tutorial
Replacement Strings Tutorial
Applications and Languages
Regular Expressions Examples
Regular Expressions Reference
Replacement Strings Reference
Book Reviews
Printable PDF
About This Site
RSS Feed & Blog
PowerGREP 4
PowerGREP PowerGREP is probably the most powerful regex-based text processing tool available today. A knowledge worker's Swiss army knife for searching through, extracting information from, and updating piles of files.
Use regular expressions to search through large numbers of text and binary files. Quickly find the files you are looking for, or extract the information you need. Look through just a handful of files or folders, or scan entire drives and network shares.
Search and replace using text, binary data or one or more regular expressions to automate repetitive editing tasks. Preview replacements before modifying files, and stay safe with flexible backup and undo options.
Use regular expressions to rename files, copy files, or merge and split the contents of files. Work with plain text files, Unicode files, binary files, compressed files, and files in proprietary formats such as MS Office, OpenOffice, and PDF. Runs on Windows 2000, XP, Vista, 7, 8, and 8.1.
More information
Download PowerGREP now