Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012]. A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The Java platform uses the UTF-16 encoding in
char arrays and in the
StringBuffer classes. However, Java programs must often process character data in various character encodings. The
java.lang.String classes, and classes in the
java.nio.charset package can convert between UTF-16 and a number of other character encodings. The supported encodings vary among different implementations of the Java Platform. The class description for
java.nio.charset.Charset lists the encodings that every implementation of the Java Platform is required to support. These include US-ASCII, ISO-8859-1, and UTF-8.
UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.
|Byte 1||Byte 2||Byte 3||Byte 4|
UTF-8 is corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or
Similar to UTF-8, UTF-16 is a variable-width encoding. Unicode code points between U+10000 and U+10FFFF are called supplementary code points, and Unicode-encoded characters having a supplementary code point are called supplementary characters. UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit Unicode code units called surrogates to encode supplementary characters. The first Unicode code point is taken from the high-surrogates range (U+D800-U+DBFF), and the second is taken from the low-surrogates range (U+DC00-U+DFFF). Because UTF-16 code point ranges for high and low surrogates, as well as for single units, are all completely disjoint, there are no false matches, the location of the character boundary can be directly determined from each code unit value, and a dropped surrogate will corrupt only a single character.
Programmers must not form strings containing partial characters, for example, when converting variable-width encoded character data to strings.
Noncompliant Code Example (Read)
This noncompliant code example tries to read up to 1024 bytes from a socket and build a
String from this data. It does so by reading the bytes in a
while loop, as recommended by FIO10-J. Ensure the array is filled when using read() to fill an array. If it ever detects that the socket has more than 1024 bytes available, it throws an exception, which prevents untrusted input from potentially exhausting the program's memory.
This code fails to account for the interaction between variable-width character encodings and the boundaries between the loop iterations. If the last byte read from the data stream in one
read() operation is the leading byte of a character, the trailing bytes are not encountered until the next iteration of the
while loop. However, variable-width encoding is resolved during construction of the new
String within the loop. Consequently, the variable-width encoding can be interpreted incorrectly. A similar problem can occur when constructing strings from UTF-16 data if the surrogate pair for a supplementary character are separated.
Compliant Solution (Read)
This compliant solution defers creation of the string until all the data is available:
This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full.
Compliant Solution (
This compliant solution uses a
Reader rather than an
Reader class converts bytes into characters on the fly, so it avoids the hazard of splitting variable-width characters. This routine aborts if the socket provides more than 1024 characters rather than 1024 bytes.
Forming strings from character data containing partial characters can result in data corruption.
|||STR00-J. Don't form string containing partial characters from variable-width encodings LiveLesson|