Page History

...

The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^31 UCS codes can be encoded using UTF-8

Generally, all programs should perform checks for any validate UTF-8 data for UTF-8 validity before performing other checks. The table below lists all valid UTF-8 Sequences.

...

Below are more specific recommendations.

Accept Only

...

the "

...

Shortest"

...

Form

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encoding that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

Process A performs security checks, but does not check for non-shortest UTF-8 forms.
Process B accepts the byte sequence from process A, and transform it into UTF-16 while interpreting possible non-shortest forms.
The UTF-16 text may then contain characters that should have been filtered out by process A , and could potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high profile products, such as Microsoft's IIS web server.

Handling Invalid Inputs

UTF-8 decoders have no uniformly defined behavior upon encountering an invalid input. Below are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

...

Wiki Markup

The following function from \[[Viega 03|AA. C References#Viega 03]\] will detectdetects invalid character sequences in a string but willdoes not reject non-minimal forms. It returns {{1}} if the string is comprised only of legitimate sequences; elseotherwise it returns {{0}}.

Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

...

Failing to properly handle UTF8 encoded data can result in a data integrity violation or denial-of-service situationattack.

Recommendation	Severity	Likelihood	Remediation Cost	Priority	Level
MSC10-A	medium	unlikely	high	P2	L3

...

Space shortcuts

Page tree

Versions Compared

Old Version 31

New Version 32

Key

Accept Only

the "

Shortest"

Form

Handling Invalid Inputs