...
Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.
| Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 |
|---|
| Byte 2 |
|---|
| Byte 3 |
|---|
| Byte 4 | |
|---|---|
| 7 | U+0000 |
| U+007F |
| 1 | 0xxxxxxx |
| 11 | U+0080 |
|---|
| U+07FF |
| 2 | 110xxxxx | 10xxxxxx |
| 16 |
|---|
| U+0800 |
| U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
| 21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
|---|
Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].
...
Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.
Recommendation | Severity | Likelihood |
|---|
Detectable | Repairable | Priority | Level |
|---|---|---|---|
MSC10-C | Medium | Unlikely | No |
No | P2 | L3 |
Automated Detection
Tool | Version | Checker | Description | |||||
|---|---|---|---|---|---|---|---|---|
| LDRA tool suite |
| 176 S |
, 376 S |
Partially implemented |
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
Related Guidelines
| SEI CERT C++ |
| Coding Standard | VOID MSC10-CPP. Character encoding: UTF8-related issues |
| MITRE CWE | CWE-176, Failure to handle Unicode encoding CWE-116, Improper encoding or escaping of output |
Bibliography
| [ISO/IEC 10646:2012] |
| [Kuhn 2006] | UTF-8 and Unicode FAQ for Unix/Linux |
| [Pike 1993] | "Hello World" |
| [Unicode 2006] |
| [Viega 2003] | Section 3.12, "Detecting Illegal UTF-8 Characters" |
| [Wheeler 2003] | Secure Programmer: Call Components Safely |
| [Yergeau 1998] | RFC 2279 |
...
...