...
UCS Code (HEX) | Binary UTF-8 Format | Legal UTF-8 Values (HEX) | |
|---|---|---|---|
00-7F | 0xxxxxxx | 00-7F | |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF | |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF | |
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF | |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF | |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF | |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF | |
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
Security Related Issues
The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may accept encoding that are longer than necessary, this means that potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputs.
...