Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
  • All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a null byte) can appear as part of another character. This property supports the use of string-handling functions.
  • It is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 UCS codes can be encoded using UTF-8.

Generally, programs should validate UTF-8 data before performing other checks. The following table lists all valid the well-formed UTF-8 byte sequences.

Code Points1st Byte2nd Byte3rd Byte4th Byte
U+0000..U+007F00..7F   
U+0080..U+07FFC2..DF80..BF  
U+0800..U+0FFFE0A0..BF80..BF 
U+1000..U+CFFFE1..EC80..BF80..BF 
U+D000..U+D7FFED80..9F80..BF 
U+E000..U+FFFFEE..EF80..BF80..BF 
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
U+100000..U+10FFFFF480..8F80..BF80..

UCS Code (HEX)

Binary UTF-8 Format

Valid UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 31-bit ISO 10646 code space [ISO/IEC 10646:2003(E)].

...

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Following are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations. 

  1. Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not availableInsert a replacement character (e.g., ?, the wildcard character).
  2. Ignore the bytes (ex. delete the invalid byte before the validation process. see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information)
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map. other encoding such as Shift_JIS is known to trigger self XSS thus potentially dangerous).
  4. Fail to notice but decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error.

...

Sources

[ISO/IEC 10646:2003]

[ISO/IEC 10646:2012]
[Kuhn 2006]
[Pike 1993]
[Unicode 2006]
[Viega 2003] Section 3.12, "Detecting illegal UTF-8 characters"
[Wheeler 2003]
[Yergeau 1998]

...