If you take I18N serious, then you should take a look at Sun’s latest “Core Java Technologies Technical Tips” covering Strings if you think the following code sample is always the correct way to count the length of a string:
private String testString = "abcd\\u5B66\\uD800\\uDF30"; int charCount = testString.length();
Java 5 introduces support for Unicode 4.0, which defines a significant number of new characters above U+FFFF (the U+ prefix signifies a valid Unicode character value as a hexadecimal number.). Thus, the 16-bit char type does no longer represent all characters!
The consequence is that a single char value can still represent a Unicode value, but only up to U+FFFF. You require a char surrogate pair to represent supplementary characters. The leading or high value of the pair is in the U+D800 through U+DBFF range; the trailing or low value is in the U+DC00 through U+DFFF range. Using surrogate pairs, programmers can represent any character in the Unicode Standard. This special use of 16-bit units is called UTF-16, and Java 5 uses it to represent Unicode 4.0 characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character. A complete Unicode character is named code point.
For an example, in the code sample above, the (UTF-16) string contains the Gothic letter as the surrogate pair \uD800\uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7!
To find out how many Unicode character code points are in a string, use the codePointCount() method:
private String testString = "abcd\\u5B66\\uD800\\uDF30"; int charCount = testString.length(); int characterCount = testString.codePointCount(0, charCount); System.out.printf("character count: %d\\n", characterCount);
The output is “6”.