 
                            ...
Generally, all programs should perform checks for any UTF-8 data for UTF-8 legality validity before performing other checks. The table below lists all Legal valid UTF-8 Sequences.
|  UCS Code (HEX)  | Binary UTF-8 Format | Legal Valid UTF-8 Values (HEX) | 
|---|---|---|
| 00-7F |  0xxxxxxx  | 00-7F | 
| 80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF | 
| 800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF | 
| 1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF | 
| 10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF | 
| 40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF | 
| 40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF | 
| 100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF | 
...
Implementors of UTF-8 need to consider the security aspects of how they handle illegal invalid UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal invalid octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal invalid two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal invalid octet sequence 2F C0 AE 2E 2F.
...
| Wiki Markup | 
|---|
| The following function from \[[Viega 03|AA. C References#Viega 03]\] will detect illegalinvalid character sequences in a string. It returns {{1}} if the string is comprised only of legitimate sequences, else it returns {{0}}: | 
...
Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal invalid in Unicode, and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They could also indicate internal bugs in an application, or intentional efforts to find security vulnerabilities.
...