Skip to content

Yet another programming solutions log

Sample bits from programming for the future generations.

Technologies Technologies
  • Algorithms and Data Structures
  • Java Tutorials
  • JUnit Tutorial
  • MongoDB Tutorial
  • Quartz Scheduler Tutorial
  • Spock Framework Tutorial
  • Spring Framework
  • Bash Tutorial
  • Clojure Tutorial
  • Design Patterns
  • Developer’s Tools
  • Productivity
  • About
Expand Search Form

Java regex remove duplicated words

farenda 2017-03-24 0

Regular Expressions are very handy for text processing. In this article we’ll show Java Regex to remove duplicated words, which is a common task.

Regular Expression to match subsequent

Java regex to remove duplicated words is not very complex, but can be problematic to write at the first time:

String regex = "\\b(\\w+)(\\s+\\1\\b)+";

What all that means:

  1. \b: look for word boundary (match only beginning of word instead of somewhere in the middle);
  2. (\w+): match one ore more word characters and remember them as a group (the parens) to which later we can refer to using a number; so this matches a complete word and remembers it;
  3. \s+: match one or more space characters;
  4. \1: match the word remembered in step 2;
  5. \b: like in step 1 – make sure it’s not a part of some longer word;
  6. (\s+\1\b)+: match one or more occurrences of the word captured in step 2.

That’s it! If you want to match words in case insensitive way then just compile the above Regular Expression with CASE_INSENSITIVE flag:

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Replacement loop

The second important part of the word de-duplication is the replacement loop, which does the actual replacement of duplicated pattern with only one word:

String input = "The the string String string stringing.";
Matcher m = p.matcher(input);
while (m.find()) {
    input = input.replaceAll(m.group(), m.group(1));
}

It matches every occurrence of the Regular Expression defined above and replaces whole matched string/pattern (here m.group()) with the content of the first remembered group (m.group(1)), which is our single word.

When applied on the input string, m.group() and m.group(1) will have the following values in subsequent iterations of the while loop:

  • m.group(): “The the” and m.group(1): ‘The’
  • m.group(): “string String string” and m.group(1): “string”.

Remove duplicated words complete example

Whole Java application that removes duplicated words may look like this:

package com.farenda.java.util.regex;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Dedupper {

    public static void main(String[] args) {
        String input = "The the string String string stringing.";

        String regex = "\\b(\\w+)(\\s+\\1\\b)+";

        // Use compile(regex) if you want case sensitive.
        Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

        Matcher m = p.matcher(input);
        while (m.find()) {
            input = input.replaceAll(m.group(), m.group(1));
        }

        System.out.println(input);
    }
}

The above code produces the following output:

The string stringing.

References:

  • Check out Java Tutorials if you want to learn more!
Share with the World!
Categories Java Tags java, java-util, regex
Previous: Java regex matching IP Address
Next: MongoDB update document in array

Recent Posts

  • Java 8 Date Time concepts
  • Maven dependency to local JAR
  • Caesar cipher in Java
  • Java casting trick
  • Java 8 flatMap practical example
  • Linked List – remove element
  • Linked List – insert element at position
  • Linked List add element at the end
  • Create Java Streams
  • Floyd Cycle detection in Java

Pages

  • About Farenda
  • Algorithms and Data Structures
  • Bash Tutorial
  • Bean Validation Tutorial
  • Clojure Tutorial
  • Design Patterns
  • Java 8 Streams and Lambda Expressions Tutorial
  • Java Basics Tutorial
  • Java Collections Tutorial
  • Java Concurrency Tutorial
  • Java IO Tutorial
  • Java Tutorials
  • Java Util Tutorial
  • Java XML Tutorial
  • JUnit Tutorial
  • MongoDB Tutorial
  • Quartz Scheduler Tutorial
  • Software Developer’s Tools
  • Spock Framework Tutorial
  • Spring Framework

Tags

algorithms bash bean-validation books clojure design-patterns embedmongo exercises git gof gradle groovy hateoas hsqldb i18n java java-basics java-collections java-concurrency java-io java-lang java-time java-util java-xml java8 java8-files junit linux lists log4j logging maven mongodb performance quartz refactoring regex rest slf4j solid spring spring-boot spring-core sql unit-tests

Yet another programming solutions log © 2021

sponsored