Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: REM Cost Reform

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one 1 to four 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.:

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
  • All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character.
  • It's It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 2^21 UCS codes can be encoded using UTF-8.

Generally, all programs should perform checks for any validate UTF-8 data for UTF-8 legality before performing other checks. The table below listed all Legal following table lists the well-formed UTF-8 Sequences.

UCS Code (HEX)

Binary UTF-8 Format

Legal UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Security Related Issues

byte sequences.

Bits of code pointFirst code pointLast code pointBytes in sequenceByte 1Byte 2Byte 3Byte 4
  7U+0000U+007F10xxxxxxx
11U+0080U+07FF2110xxxxx10xxxxxx
16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
21U+10000U+1FFFFF411110xxx10xxxxxx10xxxxxx10xxxxxx

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

Security-Related Issues

According to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998], Wiki MarkupAccording to \[[Yergeau 98|AA. C References#Yergeau 98]\]:

Implementors of UTF-8 need to consider the security aspects of how they handle illegal invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack could can be carried out against a parser which that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal invalid octet sequences as characters. For example, a parser might prohibit the NUL null character when encoded as the single-octet sequence 00, but allow the illegal invalid two-octet sequence C0 80 and interpret it as a NUL null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal invalid octet sequence 2F C0 AE 2E 2F.

Below Following are more specific recommendations.

Accept Only

...

the Shortest Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:,

  1. Process A performs security checks , but does not check for non-shortest nonshortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A , and transform transforms it into UTF-16 while interpreting possible non-shortest nonshortest forms.
  3. The UTF-16 text may then contain characters that should have been filtered out by process A , and could can potentially be dangerous. These

...

  1. "

...

  1. nonshortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS

...

  1. Web server.

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the nonshortest forms.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations. 

  1. Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
  2. Ignore the bytes (for example, delete the invalid byte before the validation process; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information).
  3. Insert a replacement character (e.g. '?', the "wild-card" character)
  4. Ignore the bytes.
  5. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map; other encoding, such as Shift_JIS, is known to trigger self-XSS, and so is potentially dangerous).
  6. Not notice and Fail to notice but decode as if the bytes were some similar bit of UTF-8.
  7. Stop decoding and report an error

...

  1. .

The following function from \[[Viega 03|AA. C References#Viega 03]\] will detect illegal character sequences in a string. It returns {{1}} if the string is comprised only of legitimate sequences, else it returns {{0}}:from John Viega's "Protecting Sensitive Data in Memory" [Viega 2003] detects invalid character sequences in a string but does not reject nonminimal forms. It returns 1 if the string is composed only of legitimate sequences; otherwise, it returns 0.

Code Block
Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) =  = 0x80) return 0;
    else if ((*c & 0xe0) =  = 0xc0) nb = 1;
    else if ((*c & 0xf0) =  = 0xe0) nb = 2;
    else if ((*c & 0xf8) =  = 0xf0) nb = 3;
    else if ((*c & 0xfc) =  = 0xf8) nb = 4;
    else if ((*c & 0xfe) =  = 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are illegal invalid in Unicode , and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could can also indicate internal bugs in an application , or intentional efforts to find security vulnerabilities.

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation

Severity

Likelihood

Detectable

Repairable

Priority

Level

MSC10-C

Medium

Unlikely

No

No

P2

L3

Automated Detection

Tool

Version

Checker

Description

LDRA tool suite
Include Page
LDRA_V
LDRA_V

176 S, 376 S

Partially implemented

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Reference

Related Guidelines

SEI CERT C++ Coding StandardVOID MSC10-CPP. Character encoding: UTF8-related issues
MITRE CWECWE-176, Failure to handle Unicode encoding
CWE-116, Improper encoding or escaping of output

Bibliography


...

Image Added Image Added Image Added Wiki Markup\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux \[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters" \[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO \[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646