Page History

...

Note that the read() methods return as soon as they find available input data. Ignoring the result returned by the read() methods is a violation of guideline EXP00-J. Do not ignore values returned by methods. Even Security issues can arise even when return values are not ignored, security issues can arise because by default, none of the methods guarantee that all the requested bytes will be read. It is left to the programmer to considered, because the default behavior of the read() methods lacks any guarantee that the entire buffer array will be filled. The programmer must check the number of bytes actually read and call the read() method again as required.

Failure to Another source of data read errors is failure to correctly handle multibyte encoded data is another source of data read errors. Multibyte encodings such as UTF-8 are used for character sets that require more than one byte to uniquely identify each constituting constituent character. For example, the Japanese encoding Shift-JIS (shown below), supports multibyte encoding wherein the maximum character length is two bytes (one leading and one trailing byte).

...

The trailing byte ranges overlap the range of both the single byte and lead byte characters. This can cause issues because if When a multibyte character is separated between across a buffer boundariesboundary, it will can be interpreted differently , as defined by than it if were not separated across the buffer boundary; this difference arises due to the ambiguity of its composing bytes [Phillips 2005].

...

This noncompliant code example attempts to read a specific number of {1024}} bytes from a FileInputStream but suffers from a few pitfalls. The objective is to read 1024 bytes and and to return them as a String. Unfortunately, this may not happen because of the general contract of the read() methods.

...

A second issue involves multibyte character encoding. It is possible for the read() method to read data When the last byte read from the data stream terminating is the String buffer str with the leading byte of a multibyte character and in , the trailing bytes will not be encountered until the next iteration reading of the trailing bytes. This is because when the bytes are concatenated to str, the multibyte encoding information is lost as it does not extend across buffer boundaries. while loop. However, multi-byte encoding is resolved during construction of the new String within the loop. Consequently, the multibyte encoding will be interpreted incorrectly in this case.

Finally, because no specific character encoding is specified in the call to the String class constructorFinally, the buffer str contains data represented by the system's default encoding of the system because no specific encoding is specified in the call to the String class constructorcharacter encoding. This will be problematic when the system's default character encoding differs from the intended character encoding.

Compliant Solution (1)

This compliant solution accounts reads all the desired bytes into its buffer, accounting for the total number of bytes read ( and adjusts adjusting the remaining bytes' offset) so , thus ensuring that the required data is fully read in full. It avoids splitting multibyte encoded characters across buffers by deferring construction of the until all of the desired data has been read. It also specifies an explicit character encoding for the String encoding explicitly constructor to facilitate portability across systems that use different default character encodings.

Code Block

bgColor	#ccccff

public static String readBytes(FileInputStream in) throws IOException {
  int offset = 0;
  int bytesRead = 0;
  byte[] data = new byte[1024];
  while (true) { 
    bytesRead += in.read(data, offset, data.length - offset);
    if (bytesRead == -1 || offset >= data.length)
      break;
    offset += bytesRead;
  }
  String str = new String(data, "UTF-8");
  return str;
}

The size of the data byte buffer depends on the maximum number of bytes required to write an encoded character. For example, UTF-8 encoded data requires a maximum of three four bytes to denote one character. Although it seems counter intuitive, represend any character above U+FFFF requires a maximum of four bytes. However, such a sequence is . Because Java uses the UTF-16 character encoding to represent char data, such sequences are split into two separate char values of two bytes each as Java internally uses UTF-16 for representing a char. Consequently, the buffer size should be four times the size of a typical byte sequence.

...

The no argument and one argument readFully() methods of the DataInputStream class can be used to read all the requested data. An EOFException is thrown if the end of input is detected before the required number of bytes have been read, and an IOException is thrown if some other input/output error occurs. The exception handler decides the way forward.

Code Block

bgColor	#ccccff

public static String readBytes(FileInputStream fis) throws IOException {
  byte[] data = new byte[1024];
  DataInputStream dis = new DataInputStream(fis);
  dis.readFully(data);
  String str = new String(data, "UTF-8");
  return str;
}

...

Wiki Markup

[[API 2006|AA. Bibliography#API 06]\] Class {{InputStream}}, {{DataInputStream}}
[[PhillipsChess 20052007|AA. Bibliography#PhillipsBibliography#Chess 0507]\] 8.1 Handling Errors with Return Codes
[[Harold 1999|AA. Bibliography#Harold 99]\] Chapter 7: Data Streams, Reading Byte Arrays
[[Chess 2007|AA. Bibliography#Chess 07]\] 8.1 Handling Errors with Return Codes
\[[MITRE 2009|AA. Bibliography#MITRE 09]\] [CWE ID 135|http://cwe.mitre.org/data/definitions/135.html] "Incorrect Calculation of Multi-Byte String Length"
[[Phillips 2005|AA. Bibliography#Phillips 05]\]

...

FIO01-J. Do not expose buffers created using the wrap() or duplicate() methods to untrusted code 09. Input Output (FIO) FIO03-J. Specify the character encoding while performing file or network IO

Space shortcuts

Page tree

Versions Compared

Old Version 37

New Version 38

Key

Compliant Solution (1)