...
Generally, programs should validate UTF-8 data before performing other checks. The following table lists the well-formed UTF-8 byte sequences.
| Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | SecondByte 2 | ThirdByte 3 | FourthByte 4 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | U+0000 | ..U+007F | 00..7F | 1 | 0xxxxxxx | |||||||
| 11 | U+0080 | ..U+07FF | C2..DF | 80..BF | 2 | 110xxxxx | 10xxxxxx | |||||
| 16 | U+0800..U+0FFF | E0 | A0..BF | 80..BF | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | ||||
| U+D000..U+D7FF | ED | 80..9F | 80..BF | |||||||||
| U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |||||||||
| U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | ||||||||
| U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | ||||||||
| FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | ||||||||
| 21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].
...