Page History

...

Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4
7

Code Points1st Byte2nd Byte3rd Byte4th Byte

	U+0000

..

U+007F

00..7F

1	`0xxxxxxx`
11	U+0080

..

U+07FF

C2..DF80..BF

16
2	`110xxxxx`	`10xxxxxx`

	U+0800

..U+0FFF

E0A0..BF80..BF U+1000..U+CFFFE1..EC80..BF80..BF U+D000..U+D7FFED80..9F80..BF U+E000..U+FFFFEE..EF80..BF80..BF U+10000..U+3FFFFF090..BF80..BF80..BFU+40000..U+FFFFFF1..F380..BF80..BF80..BF

U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

U+100000..U+10FFFFF480..8F80..BF80..BF

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

Security-Related Issues

According to to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998],

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

Following are more specific recommendations.

Accept Only the Shortest Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example,

...

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the non-shortest nonshortest forms.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence. Note that implementing these behaviors requires careful security considerations.

Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
Ignore the bytes (ex. for example, delete the invalid byte before the validation process. ; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information).
Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map. ; other encoding, such as Shift_JIS, is known to trigger self-XSS thus , and so is potentially dangerous).
Fail to notice but decode as if the bytes were some similar bit of UTF-8.
Stop decoding and report an error.

...

Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

Broken Surrogates

Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation	Severity	Likelihood	Detectable

Remediation Cost

Repairable	Priority	Level
MSC10-C

medium

Medium

Unlikely

unlikely

No

high

No

P2

L3

Automated Detection

Tool

Version

Checker

Description

LDRA tool suite

Include Page

	LDRA_V
	LDRA_V

176 S

, 376 S

Fully

Partially implemented

.

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Related Guidelines

SEI CERT C++

...

Coding Standard

...

VOID MSC10-CPP. Character

...

encoding: UTF8-related issues

MITRE CWE

...

CWE-176,

...

Failure to handle Unicode encoding

...

CWE-116,

...

Improper encoding or escaping of output

...

Bibliography

[ISO/IEC 10646:

...

2012]
[Kuhn 2006]	UTF-8 and Unicode FAQ for Unix/Linux
[Pike 1993]	"Hello World"
[Unicode 2006]
[Viega 2003]	Section 3.12, "Detecting

...

Illegal UTF-8

...

Characters"
[Wheeler 2003]	Secure Programmer: Call Components Safely
[Yergeau 1998]	RFC 2279

...

Image Modified Image Modified Image Modified

Space shortcuts

Page tree

Versions Compared

Old Version 65

New Version Current

Key