UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.

Generally, all programs should perform checks for any UTF-8 data for UTF-8 legality before performing other checks. The table below listed all Legal UTF-8 Sequences.

UCS Code (HEX)

Binary UTF-8 Format

Legal UTF-8 Values (HEX)

00-7F

0xxxxxxx

00-7F

80-7FF

110xxxxx 10xxxxxx

C2-DF 80-BF

800-FFF

1110xxxx 10xxxxxx 10xxxxxx

E0 A0*-BF 80-BF

1000-FFFF

1110xxxx 10xxxxxx 10xxxxxx

E1-EF 80-BF 80-BF

10000-3FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F0 90*-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

40000-FFFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F1-F3 80-BF 80-BF 80-BF

100000-10FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

F4 80-8F* 80-BF 80-BF


Security Related Issues

The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may accept encoding that are longer than necessary, this means that potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputs. For example:

  1. Process A perfoms security checks, but does not check for non-shortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A, and transform it into UTF-16 while interpreting possible non-shortest forms.
  3. The UTF-16 text may then contain characters that should have been filtered out by process A, and could potentially be dangerous.  

 Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.

 Reference