Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

UCS Code (HEX)

Binary UTF-8 Format

Valid UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Wiki MarkupAlthough UTF-8 originated from the Plan 9 developers \[ [Pike 1993|AA. Bibliography#Pike 93]\], Plan 9's own support only covers the low 16-bit range. In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[ [ISO/IEC 10646:2003(E)|AA. Bibliography#ISO/IEC 10646-2003]\].

Security-Related Issues

Wiki MarkupAccording to \ [[Yergeau 1998|AA. Bibliography#Yergeau 98]\]

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack can be carried out against a parser that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

...

  1. Process A performs security checks, but does not check for non-shortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible non-shortest forms.
  3. The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.

Wiki Markup[Corrigendum #1: UTF-8 Shortest Form|http://www.unicode.org/versions/corrigendum1.html] to the Unicode Standard \ [[Unicode 2006|AA. Bibliography#Unicode 06] \] describes modifications to Version 3.0 of The Unicode Standard necessary to define what is meant by the shortest form.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Insert a replacement character (e.g., "?," the "wild-card" character).
  2. Ignore the bytes.
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map).
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error.

...

The following function from \[ [Viega 2003|AA. Bibliography#Viega 03] \] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is composed only of legitimate sequences; otherwise, it returns {{0}}.

Code Block
int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

...

Tool

Version

Checker

Description

Section

LDRA tool suite

Include Page
c:LDRA_Vc:
LDRA_V
Section

176 S
376 S

Section

Fully Implemented

...

MITRE CWE: CWE-176, "Failure to Handle Unicode Encoding" and CWE-116, "Improper Encoding or Escaping of Output"

Bibliography

...

\[[ISO/IEC 10646:2003|AA. Bibliography#ISO/IEC 10646-2003]\] \[]
[Kuhn 2006|AA. Bibliography#Kuhn 06] \[[Pike 1993|AA. Bibliography#Pike 93]\] \[[Unicode 2006|AA. Bibliography#Unicode 06]\] \[[Viega 2003|AA. Bibliography#Viega 03]\] Section 2006
[Pike 1993]
[Unicode 2006]
[Viega 2003] Section 3.12, "Detecting Illegal UTF-8 Characters" \[
[Wheeler 2003|AA. Bibliography#Wheeler 03]\] \[]
[Yergeau 1998|AA. Bibliography#Yergeau 98]\]

...

MSC09-C. Character Encoding - Use Subset of ASCII for Safety      49. Miscellaneous (MSC)