Page History

...

Insert a replacement character (e.g. "?" the wild card character)
Ignore the bytes.
Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map)
Not notice and decode as if the bytes were some similar bit of UTF-8.
Stop decoding and report an error

The following function from Viega 03 will detect illegal character sequences in a string. It returns 1 if the string is comprised only of legitimate sequences, else it returns 0:

Code Block


int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;
  
  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) =  = 0x80) return 0;
    else if ((*c & 0xe0) =  = 0xc0) nb = 1;
    else if ((*c & 0xf0) =  = 0xe0) nb = 2;
    else if ((*c & 0xf8) =  = 0xf0) nb = 3;
    else if ((*c & 0xfc) =  = 0xf8) nb = 4;
    else if ((*c & 0xfe) =  = 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  } 
  return 1;
}

Broken Surrogates

The most recent requirement for UTF-8 encoding is that encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal in Unicode, so they introduce ambiguity when they appear in Unicode data. Again they could be used to create strings that appeared similar but were not really similar, particularly when applications ignore the bad data. Broken surrogates could be signs of bad data transmission. They could also indicate internal bugs in application or intentional efforts to find security problems.

...

The RFC describes the problem this way: Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet (byte) sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as a character. For example, a parser might prohibit the NUL character when encoded as single-octet sequence 00, but allow the illegal two-octet sequence C0 80 (illegal because it's longer than necessary) and interpret it as a NUL character (00). Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F c) AE 2E 2F.
http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html
http://en.wikipedia.org/wiki/UTF-8
Viega 03

Space shortcuts

Page tree

Versions Compared

Old Version 9

New Version 10

Key

Broken Surrogates