
...
Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.
Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 |
---|
Byte 2 |
---|
Byte 3 |
---|
Byte 4 | |
---|---|
7 | U+0000 |
U+007F |
1 | 0xxxxxxx |
11 | U+0080 |
---|
U+07FF |
2 | 110xxxxx | 10xxxxxx |
16 |
---|
U+0800 |
U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
---|
Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].
...
Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.
Recommendation | Severity | Likelihood |
---|
Detectable | Repairable | Priority | Level |
---|---|---|---|
MSC10-C | Medium | Unlikely | No |
No | P2 | L3 |
Automated Detection
Tool | Version | Checker | Description | |||||
---|---|---|---|---|---|---|---|---|
LDRA tool suite |
| 176 S |
, 376 S |
Partially implemented |
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
Related Guidelines
SEI CERT C++ |
Coding Standard | VOID MSC10-CPP. Character encoding: UTF8-related issues |
MITRE CWE | CWE-176, Failure to handle Unicode encoding CWE-116, Improper encoding or escaping of output |
Bibliography
[ISO/IEC 10646:2012] |
[Kuhn 2006] | UTF-8 and Unicode FAQ for Unix/Linux |
[Pike 1993] | "Hello World" |
[Unicode 2006] |
[Viega 2003] | Section 3.12, "Detecting Illegal UTF-8 Characters" |
[Wheeler 2003] | Secure Programmer: Call Components Safely |
[Yergeau 1998] | RFC 2279 |
...
...