
| Collect your own regular expression library with RegexBuddy. RegexBuddy's regular expression library includes all the examples on this website, plus many more. Easily edit any of the regexes or create your own. Build your own personal regular expression library. It'll often come in handy and save you time when searching through files on your computer, writing applications or scripts, or processing text or data. Get your own copy of RegexBuddy now. |
If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (subsequent) duplicate lines. Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$
Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The round brackets store the matched line into the first backreference.
Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.
Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.
If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.
The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.
We can generalize the above example to afterseparator(item)(separator\1)+beforeseparator, where afterseparator and beforeseparator are zero-width. So if you want to remove subsequent duplicates from a comma-delimited list, you could use (?<=,|^)([^,]*)(,\1)+(?=,|$)
.
The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma. ([^,]*) captures the item. (,\1)+ matches subsequent duplicate items. Finally, the positive lookahead (?=,|$) checks if the duplicate items are complete items by checking for a comma or the end of the string.
Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!
Page URL: http://www.Regular-Expressions.info/duplicatelines.html
Page last updated: 17 June 2009
Site last updated: 05 March 2010
Copyright © 2003-2010 Jan Goyvaerts. All rights reserved.
| Examples |
| Examples |
| Numeric Ranges |
| Floating Point Numbers |
| Email Addresses |
| Valid Dates |
| Credit Card Numbers |
| Matching Complete Lines |
| Deleting Duplicate Lines |
| Programming |
| Two Near Words |
| Pitfalls |
| Catastrophic Backtracking |
| Making Everything Optional |
| Repeated Capturing Group |
| Mixing Unicode & 8-bit |
| More Information |
| Introduction |
| Quick Start |
| Tutorial |
| Tools and Languages |
| Examples |
| Books |
| Reference |
| Print PDF |
| About This Site |
| RSS Feed & Blog |