You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
  • All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character.
  • It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 UCS codes can be encoded using UTF-8

Generally, all programs should perform checks for any UTF-8 data for UTF-8 legality before performing other checks. The table below listed all Legal UTF-8 Sequences.

UCS Code (HEX)

Binary UTF-8 Format

Legal UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Security Related Issues

According to [[Yergeau 98]]:

Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

Below are more specific recommendations.

Only Accept the "shortest" form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

  1. Process A performs security checks, but does not check for non-shortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A, and transform it into UTF-16 while interpreting possible non-shortest forms.
  3. The UTF-16 text may then contain characters that should have been filtered out by process A, and could potentially be dangerous.

These non-"shortest" UTF-8 attacks have been used to bypass security validations in high profile products, such as Microsoft's IIS web server.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Insert a replacement character (e.g. '?', the "wild-card" character)
  2. Ignore the bytes.
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map)
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error

The following function from [[Viega 03]] will detect illegal character sequences in a string. It returns 1 if the string is comprised only of legitimate sequences, else it returns 0:

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) =  = 0x80) return 0;
    else if ((*c & 0xe0) =  = 0xc0) nb = 1;
    else if ((*c & 0xf0) =  = 0xe0) nb = 2;
    else if ((*c & 0xf8) =  = 0xf0) nb = 3;
    else if ((*c & 0xfc) =  = 0xf8) nb = 4;
    else if ((*c & 0xfe) =  = 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal in Unicode, and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could also indicate internal bugs in an application, or intentional efforts to find security vulnerabilities.

Reference

[[Kuhn 06]] UTF-8 and Unicode FAQ for Unix/Linux
[[Viega 03]] Section 3.12. "Detecting Illegal UTF-8 Characters"
[[Wheeler 06]] Secure Programming for Linux and Unix HOWTO
[[Yergeau 98]] RFC 2279 - UTF-8, a transformation format of ISO 10646

  • No labels