Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012]. A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The Java platform uses the UTF-16 encoding in char arrays and in the String, StringBuilder, and StringBuffer classes. However, Java programs must often process character data in various character encodings. The java.io.InputStreamReader, java.io.OutputStreamWriter, java.lang.String classes, and classes in the java.nio.charset package can convert between UTF-16 and a number of other character encodings. The supported encodings vary among different implementations of the Java Platform. The class description for java.nio.charset.Charset lists the encodings that every implementation of the Java Platform is required to support. These include US-ASCII, ISO-8859-1, and UTF-8.

UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.

Bits of
Code Point

First
Code Point

Last
Code Point

Bytes in
Sequence

Byte 1Byte 2Byte 3Byte 4
7U+0000U+007F10xxxxxxx   
11U+0080U+07FF2110xxxxx10xxxxxx  
16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx 
21U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx10xxxxxx

UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].

Similar to UTF-8, UTF-16 is a variable-width encoding. Unicode code points between U+10000 and U+10FFFF are called supplementary code points, and Unicode-encoded characters having a supplementary code point are called supplementary characters. UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit Unicode code units called surrogates to encode supplementary characters. The first Unicode code point is taken from the high-surrogates range (U+D800-U+DBFF), and the second is taken from the low-surrogates range (U+DC00-U+DFFF). Because UTF-16 code point ranges for high and low surrogates, as well as for single units, are all completely disjoint, there are no false matches, the location of the character boundary can be directly determined from each code unit value, and a dropped surrogate will corrupt only a single character.

Programmers must not form strings containing partial characters, for example, when converting variable-width encoded character data to strings.

Noncompliant Code Example (Read)

This noncompliant code example tries to read up to 1024 bytes from a socket and build a String from this data. It does so by reading the bytes in a while loop, as recommended by FIO10-J. Ensure the array is filled when using read() to fill an array. If it ever detects that the socket has more than 1024 bytes available, it throws an exception, which prevents untrusted input from potentially exhausting the program's memory.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  String str = new String();
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, bytesRead, "UTF-8");
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

This code fails to account for the interaction between variable-width character encodings and the boundaries between the loop iterations. If the last byte read from the data stream in one read() operation is the leading byte of a character, the trailing bytes are not encountered until the next iteration of the while loop. However, variable-width encoding is resolved during construction of the new String within the loop. Consequently, the variable-width encoding can be interpreted incorrectly. A similar problem can occur when constructing strings from UTF-16 data if the surrogate pair for a supplementary character are separated.

Compliant Solution (Read)

This compliant solution defers creation of the string until all the data is available:

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  String str = new String(data, 0, offset, "UTF-8");
  in.close();
  return str;
}

This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full.

Compliant Solution (Reader)

This compliant solution uses a Reader rather than an InputStream. The Reader class converts bytes into characters on the fly, so it avoids the hazard of splitting variable-width characters. This routine aborts if the socket provides more than 1024 characters rather than 1024 bytes.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  Reader r = new InputStreamReader(in, "UTF-8");
  char[] data = new char[MAX_SIZE+1];
  int offset = 0;
  int charsRead = 0;
  String str = new String();
  while ((charsRead = r.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, charsRead);
    offset += charsRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

Risk Assessment

Forming strings from character data containing partial characters can result in data corruption.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR00-J

Low

Unlikely

Medium

P2

L3

Automated Detection

ToolVersionCheckerDescription
Parasoft Jtest9.5INTER.COSImplemented

Bibliography

[API 2014]

Classes Character and BreakIterator

[Java Tutorials]

Character Boundaries

[Phillips 2005] 
[Seacord 2015]Image result for video icon STR00-J. Don't form string containing partial characters from variable-width encodings LiveLesson
[Unicode 2012]