UTF-8 is a variable-width encoding for gUnicodeUnicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.
- The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
- All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a null byte) can appear as part of another character. This property supports the use of string handling functions.
- It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
- The lexicographic sorting order of UCS-4 strings is preserved.
- All possible 2^31 UCS codes can be encoded using UTF-8
...
UCS Code (HEX) | Binary UTF-8 Format | Valid UTF-8 Values (HEX) |
|---|---|---|
00-7F | 0xxxxxxx | 00-7F |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF |
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
| Wiki Markup |
|---|
Although UTF-8 originated from the Plan 9 developers \[[Pike 93]\], Plan 9's own support only covers the low 16-bit range. In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[ISO/IEC 10646:2003(E)\]. |
Security Related Issues
| Wiki Markup |
|---|
According to \[[Yergeau 98|AA. C References#Yergeau 98]\]: |
Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the NUL null character when encoded as the single-octet sequence
00, but allow the invalid two-octet sequenceC0 80and interpret it as a NUL null character. Another example might be a parser which prohibits the octet sequence2F 2E 2E 2F("/../"), yet permits the invalid octet sequence2F C0 AE 2E 2F.
Below are more specific recommendations.
...
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
Reference
| Wiki Markup |
|---|
\[ISO/IEC 10646:2003(E)\] Information technology -- Universal Multiple-Octet Coded Character Set (UCS) \[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux \[[Pike 93]\] Rob Pike, Ken Thompson. _Hello World_. USENIX Winter 1993 Technical Conference, San Diego, January 25-29, 1993, Proceedings, pp. 43-50. \[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters" \[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO \[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646 |
...