...
UCS Code (HEX) | Binary UTF-8 Format | Legal UTF-8 Values (HEX) |
|---|---|---|
00-7F | 0xxxxxxx | 00-7F |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF |
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
Security Related Issues
| Wiki Markup |
|---|
According to \[[Yergeau 98|AA. C References#Yergeau 98]\]: |
...
Below are more specific recommendations.
Only Accept the "shortest" form
Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:
...
These non-"shortest" UTF-8 attacks have been used to bypass security validations in high profile products, such as Microsoft's IIS web server.
Handling Invalid Inputs
UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
...
| Code Block |
|---|
int spc_utf8_isvalid(const unsigned char *input) {
int nb;
const unsigned char *c = input;
for (c = input; *c; c += (nb + 1)) {
if (!(*c & 0x80)) nb = 0;
else if ((*c & 0xc0) = = 0x80) return 0;
else if ((*c & 0xe0) = = 0xc0) nb = 1;
else if ((*c & 0xf0) = = 0xe0) nb = 2;
else if ((*c & 0xf8) = = 0xf0) nb = 3;
else if ((*c & 0xfc) = = 0xf8) nb = 4;
else if ((*c & 0xfe) = = 0xfc) nb = 5;
while (nb-- > 0)
if ((*(c + nb) & 0xc0) != 0x80) return 0;
}
return 1;
}
|
Broken Surrogates
Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal in Unicode, and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could also indicate internal bugs in an application, or intentional efforts to find security vulnerabilities.
Reference
| Wiki Markup |
|---|
\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux \[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters" \[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO \[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646 |