UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one 1 to four 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.:
- The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
- All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
- It's It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
- The lexicographic sorting order of UCS-4 strings is preserved.
- All possible 2^31 2^21 UCS codes can be encoded using UTF-8.
Generally, programs should validate UTF-8 data before performing other checks. The following table below lists all valid lists the well-formed UTF-8 Sequences.
UCS Code (HEX) | Binary UTF-8 Format | Valid UTF-8 Values (HEX) |
|---|---|---|
00-7F | 0xxxxxxx | 00-7F |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF |
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
| Wiki Markup |
|---|
Although UTF-8 originated from the Plan 9 developers \[[Pike 93|AA. C References#Pike 93]\], Plan 9's own support only covers the low 16-bit range. In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. C References#ISO/IEC 10646-2003]\]. |
Security Related Issues
byte sequences.
| Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|---|---|---|
| 7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||
| 11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||
| 16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
| 21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].
Security-Related Issues
According to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998], According to \[[Yergeau 98|AA. C References#Yergeau 98]\]:Wiki Markup
Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could can be carried out against a parser which that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the NULL null character when encoded as the single-octet sequence
00, but allow the invalid two-octet sequenceC0 80and interpret it as a NULL null character. Another example might be a parser which prohibits the octet sequence2F 2E 2E 2F("/../"), yet permits the invalid octet sequence2F C0 AE 2E 2F.
Below Following are more specific recommendations.
Accept Only the
...
Shortest
...
Form
Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:,
- Process A performs security checks , but does not check for non-shortest nonshortest UTF-8 forms.
- Process B accepts the byte sequence from process A , and transform transforms it into UTF-16 while interpreting possible non-shortest nonshortest forms.
- The UTF-16 text may contain characters that should have been filtered out by process A and could can potentially be dangerous. These non- "shortestnonshortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.Web server.
Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the nonshortest forms.
Handling Invalid Inputs
UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations.
- Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
- Ignore the bytes (for example, delete the invalid byte before the validation process; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information)
- Insert a replacement character (e.g. '?', the "wild-card" character)
- Ignore the bytes.
- Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map; other encoding, such as Shift_JIS, is known to trigger self-XSS, and so is potentially dangerous).
- Not notice and Fail to notice but decode as if the bytes were some similar bit of UTF-8.
- Stop decoding and report an error.
The following function from \[[Viega 03|AA. C References#Viega 03]\] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is comprised only of legitimate sequences; otherwise it returns {{0}}.from John Viega's "Protecting Sensitive Data in Memory" [Viega 2003] detects invalid character sequences in a string but does not reject nonminimal forms. It returns Wiki Markup 1 if the string is composed only of legitimate sequences; otherwise, it returns 0.
| Code Block |
|---|
| Code Block |
int spc_utf8_isvalid(const unsigned char *input) {
int nb;
const unsigned char *c = input;
for (c = input; *c; c += (nb + 1)) {
if (!(*c & 0x80)) nb = 0;
else if ((*c & 0xc0) == 0x80) return 0;
else if ((*c & 0xe0) == 0xc0) nb = 1;
else if ((*c & 0xf0) == 0xe0) nb = 2;
else if ((*c & 0xf8) == 0xf0) nb = 3;
else if ((*c & 0xfc) == 0xf8) nb = 4;
else if ((*c & 0xfe) == 0xfc) nb = 5;
while (nb-- > 0)
if ((*(c + nb) & 0xc0) != 0x80) return 0;
}
return 1;
}
|
Broken Surrogates
Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode , and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could can also indicate internal bugs in an application , or intentional efforts to find security vulnerabilities.
Risk Assessment
Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.
Recommendation | Severity | Likelihood | Detectable |
|---|
Repairable | Priority | Level |
|---|---|---|
MSC10- |
C | Medium |
Unlikely |
No |
No | P2 | L3 |
Automated Detection
...
Tool | Version | Checker | Description |
|---|---|---|---|
| LDRA tool suite |
...
| 176 S, 376 S | Partially implemented |
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
Reference
| Wiki Markup |
|---|
\[[ISO/IEC 10646:2003|AA. C References#ISO/IEC 10646-2003]\] Information technology - Universal Multiple-Octet Coded Character Set (UCS), First Edition. December, 2003.
\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux
\[[Pike 93|AA. C References#Pike 93]\]
\[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters"
\[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO
\[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646 |
Related Guidelines
| SEI CERT C++ Coding Standard | VOID MSC10-CPP. Character encoding: UTF8-related issues |
| MITRE CWE | CWE-176, Failure to handle Unicode encoding CWE-116, Improper encoding or escaping of output |
Bibliography
| [ISO/IEC 10646:2012] | |
| [Kuhn 2006] | UTF-8 and Unicode FAQ for Unix/Linux |
| [Pike 1993] | "Hello World" |
| [Unicode 2006] | |
| [Viega 2003] | Section 3.12, "Detecting Illegal UTF-8 Characters" |
| [Wheeler 2003] | Secure Programmer: Call Components Safely |
| [Yergeau 1998] | RFC 2279 |
...
MSC09-A. Character Encoding - Use Subset of ASCII for Safety 13. Miscellaneous (MSC) MSC11-A. Incorporate diagnostic tests using assertions