UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.
Generally, programs should validate UTF-8 data before performing other checks. The table below lists all valid UTF-8 sequences.
UCS Code (HEX) |
Binary UTF-8 Format |
Valid UTF-8 Values (HEX) |
---|---|---|
00-7F |
0xxxxxxx |
00-7F |
80-7FF |
110xxxxx 10xxxxxx |
C2-DF 80-BF |
800-FFF |
1110xxxx 10xxxxxx 10xxxxxx |
E0 A0*-BF 80-BF |
1000-FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
E1-EF 80-BF 80-BF |
10000-3FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F0 90*-BF 80-BF 80-BF |
40000-FFFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F4 80-8F* 80-BF 80-BF |
Although UTF-8 originated from the Plan 9 developers \[[Pike 93|AA. C References#Pike 93]\], Plan 9's own support only covers the low 16-bit range. In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. C References#ISO/IEC 10646-2003]\]. |
According to \[[Yergeau 98|AA. C References#Yergeau 98]\]: |
Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence
00
, but allow the invalid two-octet sequenceC0 80
and interpret it as a null character. Another example might be a parser which prohibits the octet sequence2F 2E 2E 2F
("/../"
), yet permits the invalid octet sequence2F C0 AE 2E 2F
.
Below are more specific recommendations.
Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:
UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
The following function from \[[Viega 03|AA. C References#Viega 03]\] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is composed only of legitimate sequences; otherwise it returns {{0}}. |
int spc_utf8_isvalid(const unsigned char *input) { int nb; const unsigned char *c = input; for (c = input; *c; c += (nb + 1)) { if (!(*c & 0x80)) nb = 0; else if ((*c & 0xc0) == 0x80) return 0; else if ((*c & 0xe0) == 0xc0) nb = 1; else if ((*c & 0xf0) == 0xe0) nb = 2; else if ((*c & 0xf8) == 0xf0) nb = 3; else if ((*c & 0xfc) == 0xf8) nb = 4; else if ((*c & 0xfe) == 0xfc) nb = 5; while (nb-- > 0) if ((*(c + nb) & 0xc0) != 0x80) return 0; } return 1; } |
Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities.
Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.
Recommendation |
Severity |
Likelihood |
Remediation Cost |
Priority |
Level |
---|---|---|---|---|---|
MSC10-C |
medium |
unlikely |
high |
P2 |
L3 |
The LDRA tool suite V 7.6.0 can detect violations of this recommendation.
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
This rule appears in the C++ Secure Coding Standard as MSC10-CPP. Character Encoding - UTF8 Related Issues.
\[[ISO/IEC 10646:2003|AA. C References#ISO/IEC 10646-2003]\] \[[ISO/IEC PDTR 24772|AA. C References#ISO/IEC PDTR 24772]\] "AJN Choice of Filenames and other External Identifiers" \[[Kuhn 06|AA. C References#Kuhn 06]\] \[[MITRE 07|AA. C References#MITRE 07]\] [CWE ID 176|http://cwe.mitre.org/data/definitions/176.html], "Failure to Handle Unicode Encoding" \[[Pike 93|AA. C References#Pike 93]\] \[[Viega 03|AA. C References#Viega 03]\] Section 3.12, "Detecting Illegal UTF-8 Characters" \[[Wheeler 03|AA. C References#Wheeler 03]\] \[[Yergeau 98|AA. C References#Yergeau 98]\] |
MSC09-C. Character Encoding - Use Subset of ASCII for Safety 49. Miscellaneous (MSC)