How to split a String into separate words in Java?


Hello everyone! I have such a task, it is necessary to divide a string into words, write it to an array and then compare each element with each in the array (i.e., each word with each) those that matched delete, recently read about equals(), which does a great job of comparing strings, but for some reason It does not work in an array. Java started learning recently, so don't judge strictly by the code, thank you all!

public static void main(String[] args) {

    String b = "Привет Привет Привет";
    String s[] = b.split(" ");

    int i;
    for (i = 0; i < s.length; i++) {
        if (s[i].equals(s[i + 1])) {
            System.out.println(s[i]);
        }
    }
}
Author: V.March, 2017-07-07

2 answers

Your option does not take into account a lot of spaces. Here you need to use a regular expression. Extract all the words and then put them in SortedSet. Why in SortedSet? Firstly, it does not allow duplication, and secondly, it will sort all words in ascending order, which makes it easier to check.

import java.util.SortedSet;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordParser {

    private static final String EXAMPLE_TEST =
            "На дворе — трава, на траве — дрова. Не руби дрова на траве двора!";

    public static void main(String[] args) {
        Pattern pattern =
                Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS 
                        | Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(EXAMPLE_TEST);
        SortedSet<String> words = new TreeSet<>();

        while (matcher.find())
            words.add(matcher.group().toLowerCase());

        for (String word : words)
            System.out.println("word = " + word);
    }
}

"\w+" - the modifier finds only words, that is, excludes signs, etc.

Pattern.UNICODE_CHARACTER_CLASS - sets the flag on Unicode so that you can search in any folder. the encoding. (To be honest, I don't know how to find words in Asian languages like Chinese, Korean, Japanese, etc.)

Here's what this class prints after it runs:

Word = Yard
word = yard
word = firewood
word = on
word = not
word = ruby
word = grass
word = grass

 1
Author: Vanguard, 2017-07-08 12:20:48

If you use the features of the Stream API, you can solve the problem even easier:

String s = ...
Stream.of(s.split("[^A-Za-zА-Яа-я]+"))
    .map(String::toLowerCase)
    .distinct().sorted()
    .forEach(System.out::println);
 1
Author: Alex Chermenin, 2017-07-17 18:52:11