Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012]. A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The Java platform uses the UTF-16 encoding in char arrays and in the String, StringBuilder, and StringBuffer classes. However, Java programs must often process character data in various character encodings. The java.io.InputStreamReader, java.io.OutputStreamWriter, java.lang.String classes, and classes in the java.nio.charset package can convert between UTF-16 and a number of other character encodings. The supported encodings vary among different implementations of the Java Platform. The class description for java.nio.charset.Charset lists the encodings that every implementation of the Java Platform is required to support. These include US-ASCII, ISO-8859-1, and UTF-8.

UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.

Bits of
Code Point

First
Code Point

Last
Code Point

Bytes in
Sequence

Byte 1Byte 2Byte 3Byte 4
7U+0000U+007F10xxxxxxx


11U+0080U+07FF2110xxxxx10xxxxxx

16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
21U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx10xxxxxx

UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].

Similar to UTF-8, UTF-16 is a variable-width encoding. Unicode code points between U+10000 and U+10FFFF are called supplementary code points, and Unicode-encoded characters having a supplementary code point are called supplementary characters. UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit Unicode code units called surrogates to encode supplementary characters. The first Unicode code point is taken from the high-surrogates range (U+D800-U+DBFF), and the second is taken from the low-surrogates range (U+DC00-U+DFFF). Because UTF-16 code point ranges for high and low surrogates, as well as for single units, are all completely disjoint, there are no false matches, the location of the character boundary can be directly determined from each code unit value, and a dropped surrogate will corrupt only a single character.

Programmers must not form strings containing partial characters, for example, when converting variable-width encoded character data to strings.

Noncompliant Code Example (Read)

This noncompliant code example tries to read up to 1024 bytes from a socket and build a String from this data. It does so by reading the bytes in a while loop, as recommended by FIO10-J. Ensure the array is filled when using read() to fill an array. If it ever detects that the socket has more than 1024 bytes available, it throws an exception, which prevents untrusted input from potentially exhausting the program's memory.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  String str = new String();
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, bytesRead, "UTF-8");
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

This code fails to account for the interaction between variable-width character encodings and the boundaries between the loop iterations. If the last byte read from the data stream in one read() operation is the leading byte of a character, the trailing bytes are not encountered until the next iteration of the while loop. However, variable-width encoding is resolved during construction of the new String within the loop. Consequently, the variable-width encoding can be interpreted incorrectly. A similar problem can occur when constructing strings from UTF-16 data if the surrogate pair for a supplementary character are separated.

Compliant Solution (Read)

This compliant solution defers creation of the string until all the data is available:

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  String str = new String(data, 0, offset, "UTF-8");
  in.close();
  return str;
}

This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full.

Compliant Solution (Reader)

This compliant solution uses a Reader rather than an InputStream. The Reader class converts bytes into characters on the fly, so it avoids the hazard of splitting variable-width characters. This routine aborts if the socket provides more than 1024 characters rather than 1024 bytes.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  Reader r = new InputStreamReader(in, "UTF-8");
  char[] data = new char[MAX_SIZE+1];
  int offset = 0;
  int charsRead = 0;
  String str = new String();
  while ((charsRead = r.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, charsRead);
    offset += charsRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

Risk Assessment

Forming strings from character data containing partial characters can result in data corruption.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR00-J

Low

Unlikely

Medium

P2

L3

Automated Detection

ToolVersionCheckerDescription
Parasoft Jtest
2023.1
CERT.STR00.COSDo not use String concatenation in an Internationalized environment

Bibliography



11 Comments

  1. I've changed the description in the front from Shift JIS to UTF-8 for a variety of reasons, but mainly because it is used in the examples.  I removed the following statement that applies to Shift JIS but not UTF-8 and therefore did not apply to the examples:

    The trailing byte ranges overlap the range of both the single-byte and lead-byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises because of the ambiguity of its composing bytes [Phillips 2005].

    Here is a description of some of the differences between these encodings:

    • UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, one can always locate the beginning of the next valid character and resume processing. Many multi-byte encodings are much harder to resynchronize.
    • Any byte oriented string searching algorithm can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else. Some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated. In Shift JIS the end byte of a character and the first byte of the next character could look like another legal character, something that can't happen in UTF-8.
  2. There are a number of ways to refer to encodings like UTF-8 and Shift JIS including:  multibyte, variable-width, variable-length, and byte encodings.

    I've gone here with "variable-width".  I don't like "multibyte" because it applies to an encoding like UTF-32 where each character uses four bytes.

    I went with width over length just because of wide characters in C and other stuff I'm used to, I suppose.  The term "variable-width" is also used on this page: http://www.unicode.org/faq/utf_bom.html

  3. A better table for illustrating how UTF-8 is encoded in 4 bytes
    https://en.wikipedia.org/wiki/UTF-8#Description

    This rule smells like a FIO rule right now, prob b/c of the exmaples.

    I think this rule and STR03-J could be merged to make a stronger rule "don't build strings from byte arrays that are not designed to hold a complete String"

    1. I think that STR is definitely the correct home for this rule.

      Concerning your last point, I think that you mean STR01-J.  There are certainly similarities, and the example here is almost the same as the first example in STR01-J.  However, combining the two might make an overly complicated rule.  Perhaps it would be better to insert a reference to STR01-J in this rule.  (There is already a reference to this rule in STR01-J.)

      1. Yes, I think I would like to keep them separate.  The biggest difference to me is that for STR01-J you also need to be concerned about operations using the various string types because these are all UTF-16 encoded.

  4. There seems to be 2 additional errors in the non-compliant code fragment.

    • readBytes is added to offset before the new string is constructed, thus the new string will not consist of the read data.
    • Creation of the new string from the input takes always the rest of the buffer into account. The more precise code would be
      str += new String(data, offset, bytesRead, "UTF-8");

     

     

     

    1. Thanks, I've fixed all the code samples to read text properly. The NCCE correctly reads text if it does not break up a variable-width character.

  5. The priority and level scores for STR00-J on the parent page (P12, L1) disagree with those on this page (P2, L3). Which is correct?

    1. The scores on this page (P12, L1) are correct, I've fixed the parent page.

      1. That was fast! Thanks!

  6. The last line of that UTF-8 table doesn't seem right.