Page History

...

The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a null byte) can appear as part of another character. This property supports the use of string-handling functions.
It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^31 UCS codes can be encoded using UTF-8.

Generally, programs should validate UTF-8 data before performing other checks. The following table lists all valid the well-formed UTF-8 byte sequences.

Code Points	1st Byte	2nd Byte	3rd Byte	4th Byte
U+0000..U+007F	00..7F
U+0080..U+07FF	C2..DF	80..BF
U+0800..U+0FFF	E0	A0..BF	80..BF
U+1000..U+CFFF	E1..EC	80..BF	80..BF
U+D000..U+D7FF	ED	80..9F	80..BF
U+E000..U+FFFF	EE..EF	80..BF	80..BF
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF
U+100000..U+10FFFF	F4	80..8F	80..BF	80..
UCS Code (HEX)	Binary UTF-8 Format	Valid UTF-8 Values (HEX)
00-7F	0xxxxxxx	00-7F
80-7FF	110xxxxx 10xxxxxx	C2-DF 80-BF
800-FFF	1110xxxx 10xxxxxx 10xxxxxx	E0 A0*-BF 80-BF
1000-FFFF	1110xxxx 10xxxxxx 10xxxxxx	E1-EF 80-BF 80-BF
10000-3FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F0 90*-BF 80-BF 80-BF
40000-FFFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F1-F3 80-BF 80-BF 80-BF
100000-10FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F4 80-8F* 80-BF 80-BF

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 31-bit ISO 10646 code space [ISO/IEC 10646:2003(E)].

...

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations.

Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not availableInsert a replacement character (e.g., ?, the wildcard character).
Ignore the bytes (ex. delete the invalid byte before the validation process. see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information)
Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map. other encoding such as Shift_JIS is known to trigger self XSS thus potentially dangerous).
Fail to notice but decode as if the bytes were some similar bit of UTF-8.
Stop decoding and report an error.

...

Sources

[ISO/IEC 10646:2003]

[ISO/IEC 10646:2012]
[Kuhn 2006]
[Pike 1993]
[Unicode 2006]
[Viega 2003] Section 3.12, "Detecting illegal UTF-8 characters"
[Wheeler 2003]
[Yergeau 1998]

...

Space shortcuts

Page tree

Versions Compared

Old Version 61

New Version 62

Key

Sources