UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.
Generally, all programs should perform checks for any UTF-8 data for UTF-8 legality before performing other checks. The table below listed all Legal UTF-8 Sequences.
UCS Code (HEX) |
Binary UTF-8 Format |
Legal UTF-8 Values (HEX) |
|---|---|---|
00-7F |
0xxxxxxx |
00-7F |
80-7FF |
110xxxxx 10xxxxxx |
C2-DF 80-BF |
800-FFF |
1110xxxx 10xxxxxx 10xxxxxx |
E0 A0*-BF 80-BF |
1000-FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
E1-EF 80-BF 80-BF |
10000-3FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F0 90*-BF 80-BF 80-BF |
40000-FFFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F4 80-8F* 80-BF 80-BF |
The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may accept encoding that are longer than necessary, this means that potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputs. For example:
This non-"shortest" UTF-8 forms have been used to bypass security validations in high profile products including Microsoft's IIS web server.
Upon receiving a invalid form of UTF-8, there is not a uniformly defined responds/behavior define by the standard for a UTF-8 decoder. In general, there are several ways that a UTF-8 may behave in the event of an invalid byte sequence:
The following function from Viega 03 will detect illegal character sequences in a string. It returns 1 if the string is comprised only of legitimate sequences, else it returns 0:
int spc_utf8_isvalid(const unsigned char *input) {
int nb;
const unsigned char *c = input;
for (c = input; *c; c += (nb + 1)) {
if (!(*c & 0x80)) nb = 0;
else if ((*c & 0xc0) = = 0x80) return 0;
else if ((*c & 0xe0) = = 0xc0) nb = 1;
else if ((*c & 0xf0) = = 0xe0) nb = 2;
else if ((*c & 0xf8) = = 0xf0) nb = 3;
else if ((*c & 0xfc) = = 0xf8) nb = 4;
else if ((*c & 0xfe) = = 0xfc) nb = 5;
while (nb-- > 0)
if ((*(c + nb) & 0xc0) != 0x80) return 0;
}
return 1;
}
|
The most recent requirement for UTF-8 encoding is that encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal in Unicode, so they introduce ambiguity when they appear in Unicode data. Again they could be used to create strings that appeared similar but were not really similar, particularly when applications ignore the bad data. Broken surrogates could be signs of bad data transmission. They could also indicate internal bugs in application or intentional efforts to find security problems.
Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.
\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux \[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters" \[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO \[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646 |