...
- The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
- All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a null byte) can appear as part of another character. This property supports the use of string-handling functions.
- It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
- The lexicographic sorting order of UCS-4 strings is preserved.
- All possible 2^31 UCS codes can be encoded using UTF-8.
Generally, programs should validate UTF-8 data before performing other checks. The following table lists all valid the well-formed UTF-8 byte sequences.
| Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
|---|---|---|---|---|
| U+0000..U+007F | 00..7F | |||
| U+0080..U+07FF | C2..DF | 80..BF | ||
| U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
| U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
| U+D000..U+D7FF | ED | 80..9F | 80..BF | |
| U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
| U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
| U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
| U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80.. |
UCS Code (HEX) | Binary UTF-8 Format | Valid UTF-8 Values (HEX) | ||
00-7F | 0xxxxxxx | 00-7F | ||
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF | ||
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF | ||
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF | ||
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF | ||
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF | ||
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 31-bit ISO 10646 code space [ISO/IEC 10646:2003(E)].
...
UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations.
- Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not availableInsert a replacement character (e.g., ?, the wildcard character).
- Ignore the bytes (ex. delete the invalid byte before the validation process. see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information)
- Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map. other encoding such as Shift_JIS is known to trigger self XSS thus potentially dangerous).
- Fail to notice but decode as if the bytes were some similar bit of UTF-8.
- Stop decoding and report an error.
...
Sources
[ISO/IEC 10646:2012]
[Kuhn 2006]
[Pike 1993]
[Unicode 2006]
[Viega 2003] Section 3.12, "Detecting illegal UTF-8 characters"
[Wheeler 2003]
[Yergeau 1998]
...