Page History

...

Wiki Markup

Although UTF-8 originated from the Plan 9 developers \[[Pike 93|AA. C References#Pike 93]\], Plan 9's own support only covers the low 16-bit range.  In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. C References#ISO/IEC 10646-2003]\].

Security-Related Issues

Wiki Markup
According to \[[Yergeau 98\|AA. C References#Yergeau 98]\]:

...

Below are more specific recommendations.

Accept Only the "Shortest" Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

Process A performs security checks, but does not check for non-shortest UTF-8 forms.
Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible non-shortest forms.
The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

...

Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation	Severity	Likelihood	Remediation Cost	Priority	Level
MSC10-A C	medium	unlikely	high	P2	L3

Automated Detection

The LDRA tool suite V 7.6.0 is able to can detect violations of this recommendation.

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

References

Wiki Markup

\[[ISO/IEC 10646:2003|AA. C References#ISO/IEC 10646-2003]\]
\[[ISO/IEC PDTR 24772|AA. C References#ISO/IEC PDTR 24772]\] "AJN Choice of Filenames and other External Identifiers"
\[[Kuhn 06|AA. C References#Kuhn 06]\]
\[[Pike 93|AA. C References#Pike 93]\]
\[[Viega 03|AA. C References#Viega 03]\] Section 3.12, "Detecting Illegal UTF-8 Characters"
\[[Wheeler 03|AA. C References#Wheeler 03]\]
\[[Yergeau 98|AA. C References#Yergeau 98]\]

...

MSC09-C. Character Encoding - Use Subset of ASCII for Safety 13. Miscellaneous (MSC) MSC11-A. Incorporate diagnostic tests using assertions Image Added

Space shortcuts

Page tree

Versions Compared

Old Version 39

New Version 40

Key