Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: REM Cost Reform

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one 1 to four 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties:

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a null byte) can appear as part of another character. This property supports the use of string handling functions.
  • It 's is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 2^21 UCS codes can be encoded using UTF-8.

Generally, programs should validate UTF-8 data before performing other checks. The next following table lists all valid the well-formed UTF-8 byte sequences.

UCS Code (HEX)

Binary UTF-8 Format

Valid UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Bits of code pointFirst code pointLast code pointBytes in sequenceByte 1Byte 2Byte 3Byte 4
  7U+0000U+007F10xxxxxxx
11U+0080U+07FF2110xxxxx10xxxxxx
16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
21U+10000U+1FFFFF411110xxx10xxxxxx10xxxxxx10xxxxxx

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only covers the low 16-bit range. In general, many "Unicode" systems only support the only the low 16-bit range, not the full 3121-bit ISO 10646 code space [ISO/IEC 10646:2003(E)2012].

Security-Related Issues

According to to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998],

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack can be carried out against a parser that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

Below Following are more specific recommendations.

Accept Only the

...

Shortest

...

Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example,

  1. Process A performs security checks , but does not check for non-shortest nonshortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible non-shortest nonshortest forms.
  3. The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These non- "shortestnonshortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web Web server.

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to Version version 3.0 of The the Unicode Standard necessary to define what is meant by the shortest formto forbid the interpretation of the nonshortest forms.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations. 

  1. Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not availableInsert a replacement character (e.g., "?," the "wild-card" character).
  2. Ignore the bytes (for example, delete the invalid byte before the validation process; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information).
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map; other encoding, such as Shift_JIS, is known to trigger self-XSS, and so is potentially dangerous).
  4. Not notice and Fail to notice but decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error.

The following function from John Viega's "Protecting Sensitive Data in Memory" [Viega 2003] detects invalid character sequences in a string but does not reject non-minimal nonminimal forms. It returns 1 if the string is composed only of legitimate sequences; otherwise, it returns 0.

Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation

Severity

Likelihood

Remediation Cost

Detectable

Repairable

Priority

Level

MSC10-C

Medium

medium

Unlikely

unlikely

No

high

No

P2

L3

Automated Detection

Tool

Version

Checker

Description

section

LDRA tool suite
Include Page
LDRA_V
LDRA_V
section

176 S


, 376 S

section

Partially implemented

Fully Implemented

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Related Guidelines

...

...

...

ISO/IEC TR 24772 "AJN Choice of Filenames and other External Identifiers"

...

...

Failure to

...

handle Unicode encoding
CWE-116,

...

Improper

...

encoding or

...

escaping of

...

output

Bibliography

...


...

Image Added Image Added MSC09-C. Character Encoding - Use Subset of ASCII for Safety      49. Miscellaneous (MSC)      Image Modified