...
Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the non-shortest nonshortest forms.
Handling Invalid Inputs
...
- Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
- Ignore the bytes (ex. for example, delete the invalid byte before the validation process. ; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information).
- Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map. ; other encoding, such as Shift_JIS, is known to trigger self-XSS thus , and so is potentially dangerous).
- Fail to notice but decode as if the bytes were some similar bit of UTF-8.
- Stop decoding and report an error.
...
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
Related Guidelines
...
...
...
...
| Failure to handle Unicode encoding |
...
...
| Improper encoding or escaping of output |
...
...
Bibliography
...
| 2012] |
...
| [Kuhn 2006] | UTF-8 and Unicode FAQ for Unix/Linux |
| [Pike 1993] | "Hello World" |
| [Unicode 2006] | |
| [Viega 2003] | Section 3.12, "Detecting |
...
| Illegal UTF-8 |
...
...