Converting a String to Unicode Code Points in Java

Converting a Java String into its Unicode representation sounds simple until you hit emoji, historical scripts or rare CJK characters β€” code points above U+FFFF. This guide covers both the fast path (char-based) and the correct path (codePoint-based).

char vs code point

A Java char is a 16-bit unsigned integer representing a UTF-16 code unit. Characters up to U+FFFF fit in a single char; characters above that (emojis, most ancient scripts) are stored as a surrogate pair β€” two char values.

String text = "Hi \uD83D\uDE00"; // "Hi πŸ˜€"
System.out.println(text.length()); // 4 β€” surrogate pair counts as 2 chars

The actual number of Unicode characters is 4: H, i, space, and πŸ˜€. To count code points correctly, use codePointCount:

System.out.println(text.codePointCount(0, text.length())); // 4

The simple case β€” no surrogates

If you're sure the string contains only BMP characters (Latin, accents, standard CJK), chars() is enough:

String s = "Java";
s.chars().forEach(c -> System.out.printf("U+%04X %n", c));
// U+004A
// U+0061
// U+0076
// U+0061

The correct case β€” codePoints() for any string

String s = "cafΓ© πŸ˜€";
s.codePoints().forEach(cp -> System.out.printf("U+%04X %n", cp));
// U+0063
// U+0061
// U+0066
// U+00E9
// U+0020
// U+1F600

codePoints() automatically combines surrogate pairs into a single int. This is the only correct way to iterate Unicode in Java.

Collect code points into an array

int[] points = s.codePoints().toArray();
System.out.println(points.length);              // 6
System.out.println(Integer.toHexString(points[5])); // 1f600

Produce a \u... representation

String escape(String s) {
    StringBuilder out = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 0x80) out.append((char) cp);          // printable ASCII
        else if (cp <= 0xFFFF) out.append(String.format("\\u%04X", cp));
        else out.append(String.format("\\U%08X", cp)); // beyond BMP
    });
    return out.toString();
}

escape("cafΓ© πŸ˜€");
// caf\u00E9 \U0001F600

Convert code points back to a String

int[] points = { 0x48, 0x69, 0x1F600 };
String s = new String(points, 0, points.length);
System.out.println(s); // Hi πŸ˜€

Or from a single code point:

String smiley = new String(Character.toChars(0x1F600)); // πŸ˜€

Character metadata

Character exposes rich Unicode metadata β€” script, type, numeric value, etc.:

int cp = "Γ©".codePointAt(0);
System.out.println(Character.getName(cp));
// LATIN SMALL LETTER E WITH ACUTE

System.out.println(Character.UnicodeScript.of(cp));
// LATIN

System.out.println(Character.isLetter(cp));
// true

System.out.println(Character.getType(cp));
// 2 (LOWERCASE_LETTER)

Trim by Unicode, not by char

Simple slicing can break a surrogate pair in half, producing invalid UTF-16:

String s = "Hi πŸ˜€ there";
String bad = s.substring(0, 4);   // "Hi \uD83D" β€” half an emoji

// Correct: convert to code points, slice, convert back
int[] cps = s.codePoints().toArray();
String good = new String(cps, 0, 4); // "Hi πŸ˜€"

Reverse a string with emoji intact

String reverseByCodePoint(String s) {
    int[] cps = s.codePoints().toArray();
    int[] reversed = new int[cps.length];
    for (int i = 0; i < cps.length; i++) reversed[i] = cps[cps.length - 1 - i];
    return new String(reversed, 0, reversed.length);
}

reverseByCodePoint("Hi πŸ˜€"); // "πŸ˜€ iH" β€” emoji preserved
"Hi πŸ˜€".chars().asDoubleStream();
new StringBuilder("Hi πŸ˜€").reverse().toString(); // mangled β€” do not use

Even StringBuilder.reverse() handles surrogates correctly in recent JDKs, but is unreliable on older versions.

Normalisation considerations

The same visual character can have multiple code-point representations. For example, Γ© can be U+00E9 (composed) or U+0065 U+0301 (e + combining accent). Use Normalizer to get a canonical form:

import java.text.Normalizer;

String nfc = Normalizer.normalize(s, Normalizer.Form.NFC); // composed
String nfd = Normalizer.normalize(s, Normalizer.Form.NFD); // decomposed

If you're comparing user input against stored data, always normalise both sides first.

Quick reference

TaskAPI
Iterate BMP characters onlys.chars()
Iterate any Unicode (correct)s.codePoints()
Get a single code points.codePointAt(index)
Count code pointss.codePointCount(0, s.length())
Code point β†’ Stringnew String(Character.toChars(cp))
int[] β†’ Stringnew String(points, 0, len)

Whenever you handle user content that can contain emoji, languages beyond the BMP, or historical scripts, default to codePoints() β€” it's barely more verbose and spares you a category of rare but nasty bugs.