...
- The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
- All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
- It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
- The lexicographic sorting order of UCS-4 strings is preserved.
- All possible 2^31 UCS codes can be encoded using UTF-8
Generally, all programs should perform checks for any validate UTF-8 data for UTF-8 validity before performing other checks. The table below lists all valid UTF-8 Sequences.
...
Below are more specific recommendations.
Accept Only
...
the "
...
Shortest"
...
Form
Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:
- Process A performs security checks, but does not check for non-shortest UTF-8 forms.
- Process B accepts the byte sequence from process A, and transform it into UTF-16 while interpreting possible non-shortest forms.
- The UTF-16 text may then contain characters that should have been filtered out by process A , and could potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high profile products, such as Microsoft's IIS web server.
Handling Invalid Inputs
UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
...
| Wiki Markup |
|---|
The following function from \[[Viega 03|AA. C References#Viega 03]\] will detectdetects invalid character sequences in a string but willdoes not reject non-minimal forms. It returns {{1}} if the string is comprised only of legitimate sequences; elseotherwise it returns {{0}}. |
| Code Block |
|---|
int spc_utf8_isvalid(const unsigned char *input) {
int nb;
const unsigned char *c = input;
for (c = input; *c; c += (nb + 1)) {
if (!(*c & 0x80)) nb = 0;
else if ((*c & 0xc0) == 0x80) return 0;
else if ((*c & 0xe0) == 0xc0) nb = 1;
else if ((*c & 0xf0) == 0xe0) nb = 2;
else if ((*c & 0xf8) == 0xf0) nb = 3;
else if ((*c & 0xfc) == 0xf8) nb = 4;
else if ((*c & 0xfe) == 0xfc) nb = 5;
while (nb-- > 0)
if ((*(c + nb) & 0xc0) != 0x80) return 0;
}
return 1;
}
|
...
Failing to properly handle UTF8 encoded data can result in a data integrity violation or denial-of-service situationattack.
Recommendation | Severity | Likelihood | Remediation Cost | Priority | Level |
|---|---|---|---|---|---|
MSC10-A | medium | unlikely | high | P2 | L3 |
...