Counting UTF-16 Characters in Java 5

If you take I18N serious, then you should take a look at Sun’s latest “Core Java Technologies Technical Tips” covering Strings if you think the following code sample is always the correct way to count the length of a string:

private String testString = "abcd\\u5B66\\uD800\\uDF30";
int charCount = testString.length();

Java 5 introduces support for Unicode 4.0, which defines a significant number of new characters above U+FFFF (the U+ prefix signifies a valid Unicode character value as a hexadecimal number.). Thus, the 16-bit char type does no longer represent all characters!

The consequence is that a single char value can still represent a Unicode value, but only up to U+FFFF. You require a char surrogate pair to represent supplementary characters. The leading or high value of the pair is in the U+D800 through U+DBFF range; the trailing or low value is in the U+DC00 through U+DFFF range. Using surrogate pairs, programmers can represent any character in the Unicode Standard. This special use of 16-bit units is called UTF-16, and Java 5 uses it to represent Unicode 4.0 characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character. A complete Unicode character is named code point.

For an example, in the code sample above, the (UTF-16) string contains the Gothic letter as the surrogate pair \uD800\uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7!

To find out how many Unicode character code points are in a string, use the codePointCount() method:

private String testString = "abcd\\u5B66\\uD800\\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\\n", characterCount);

The output is “6”.

1 Comment »

  1. joe said

    so how do you create a Character for the supplementary range using a char as input which is how the Character class requires it?

RSS feed for comments on this post · TrackBack URI

Leave a comment