Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
  • All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
  • It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 UCS codes can be encoded using UTF-8.

Generally, programs should validate UTF-8 data before performing other checks. The table below lists all valid UTF-8 Sequencessequences.

UCS Code (HEX)

Binary UTF-8 Format

Valid UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF

Wiki Markup
Although UTF-8 originated from the Plan 9 developers \[[Pike 93|AA. C References#Pike 93]\], Plan 9's own support only covers the low 16-bit range.  In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. C References#ISO/IEC 10646-2003]\].

Security-Related Issues

Wiki Markup
According to \[[Yergeau 98|AA. C References#Yergeau 98]\]:

...

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

  1. Process A performs security checks, but does not check for non-shortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A , and transform transforms it into UTF-16 while interpreting possible non-shortest forms.
  3. The UTF-16 text may contain characters that should have been filtered out by process A and could potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.

...

  1. Insert a replacement character (e.g. ', "?'," the "wild-card" character).
  2. Ignore the bytes.
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map).
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error.

Wiki Markup
The following function from \[[Viega 03|AA. C References#Viega 03]\] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is comprisedcomposed only of legitimate sequences; otherwise it returns {{0}}.

...

Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode , and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could also indicate internal bugs in an application , or intentional efforts to find security vulnerabilities.

...

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

...

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

...

References

Wiki Markup
\[[ISO/IEC 10646:2003|AA. C References#ISO/IEC 10646-2003]\] Information technology - Universal Multiple-Octet Coded Character Set (UCS), First Edition. December, 2003.
\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux
\[[Pike 93|AA. C References#Pike 93]\]
\[[Viega 03|AA. C References#Viega 03]\] Section 3.12., "Detecting Illegal UTF-8 Characters"
\[[Wheeler 0603|AA. C References#Wheeler 0603]\] Secure Programming for Linux and Unix HOWTO
\[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646

...

MSC09-A. Character Encoding - Use Subset of ASCII for Safety      13. Miscellaneous (MSC)       MSC11-A. Incorporate diagnostic tests using assertions