...
- The RFC describes the problem this way: Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet (byte) sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as a character. For example, a parser might prohibit the NUL character when encoded as single-octet sequence 00, but allow the illegal two-octet sequence C0 80 (illegal because it's longer than necessary) and interpret it as a NUL character (00). Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F c) AE 2E 2F.
- http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html
- http://en.wikipedia.org/wiki/UTF-8 Viega 03
Kuhn 06 UTF-8 and Unicode FAQ for Unix/Linux
Viega 03 Section 3.12. "Detecting Illegal UTF-8 Characters"
Wheeler 06 Secure Programming for Linux and Unix HOWTO
Yergeau 98 RFC 2279 - UTF-8, a transformation format of ISO 10646