Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Insert a replacement character (e.g. "?" the wild card character)
  2. Ignore the bytes.
  3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map)
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error 

Broken Surrogates

The most recent requirement for UTF-8 encoding is that encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are illegal in Unicode, so they introduce ambiguity when they appear in Unicode data. Again they could be used to create strings that appeared similar but were not really similar, particularly when applications ignore the bad data. Broken surrogates could be signs of bad data transmission. They could also indicate internal bugs in application or intentional efforts to find security problems.  

 Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.

...