Page History

...

UCS Code (HEX)	Binary UTF-8 Format	Legal UTF-8 Values (HEX)
00-7F	0xxxxxxx	00-7F
80-7FF	110xxxxx 10xxxxxx	C2-DF 80-BF
800-FFF	1110xxxx 10xxxxxx 10xxxxxx	E0 A0*-BF 80-BF
1000-FFFF	1110xxxx 10xxxxxx 10xxxxxx	E1-EF 80-BF 80-BF
10000-3FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F0 90*-BF 80-BF 80-BF
40000-FFFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F1-F3 80-BF 80-BF 80-BF
40000-FFFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F1-F3 80-BF 80-BF 80-BF
100000-10FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F4 80-8F* 80-BF 80-BF

Security Related Issues

Wiki Markup
According to \[[Yergeau 98\|AA. C References#Yergeau 98]\]:

Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

Below are more specific recommendations.

Only Accept the "shortest" form form

The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may accept encoding that are longer than necessary, this means that potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputs. For example:

...

Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.

Reference

...

Wiki Markup

\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux
\[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters"
\[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO
\[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646

Space shortcuts

Page tree

Versions Compared

Old Version 13

New Version 14

Key

Security Related Issues

Only Accept the "shortest" form form

Reference