
...
UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.
Bits of | First | Last | Bytes in | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx |
11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx |
---|
16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx |
---|
21 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
---|
UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].
...
Forming strings from character data containing partial characters can result in data corruption.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
STR00-J | Low | Unlikely | Medium | P2 | L3 |
Automated Detection
Tool | Version | Checker | Description |
---|---|---|---|
Parasoft Jtest |
|
|
| INTER.COS |
Do not use String concatenation in an Internationalized environment |
Bibliography
...
...