UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.
- The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
- All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character.
- It's eas to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
- The lexicographic sorting order of UCS-4 strings is preserved.
- All possible 2^31 UCS codes can be encoded using UTF-8