...
The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may accept encoding that are longer than necessary, this means that potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputs. For example:
- Process A perfoms security checks, but does not check for non-shortest UTF-8 forms.
- Process B accepts the byte sequence from process A, and transform it into UTF-16 while interpreting possible non-shortest forms.
- The UTF-16 text may then contain characters that should have been filtered out by process A, and could potentially be dangerous.
Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.
...