Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

Security-Related Issues

According to [Yergeau 1998],

...

Following are more specific recommendations.

Accept Only the Shortest Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example,

...

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the nonshortest forms.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence. Note that implementing these behaviors requires careful security considerations. 

...

Code Block
int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation

Severity

Likelihood

Remediation Cost

Priority

Level

MSC10-C

medium

unlikely

high

P2

L3

Automated Detection

Tool

Version

Checker

Description

LDRA tool suite

Include Page
LDRA_V
LDRA_V

176 S
376 S

Fully implemented.

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Related Guidelines

CERT C++ Secure Coding StandardMSC10-CPP. Character Encoding - UTF8 Related Issues
MITRE CWECWE-176, Failure to handle Unicode encoding
CWE-116, Improper encoding or escaping of output

Bibliography

...