Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.

Code PointsSecond Third Fourth  
Bits of code pointFirst code pointLast code pointBytes in sequenceByte 1Byte 2Byte 3Byte 4
  7U+0000..U+007F00..7F   10xxxxxxx
11U+0080..U+07FFC2..DF80..BF 2110xxxxx10xxxxxx
16U+0800..U+0FFFE0A0..BF80..BF U+1000..U+CFFFE1..EC80..BF80..BF 
U+D000..U+D7FFED80..9F80..BF 
U+E000..U+FFFFEE..EF80..BF80..BF 
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
FFFF31110xxxx10xxxxxx10xxxxxx
21U+10000U+1FFFFF411110xxx10xxxxxx10xxxxxx10xxxxxxU+100000..U+10FFFFF480..8F80..BF80..BF

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

...