Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Parasoft Jtest 2020.2

...

UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.

Bits of
Code Point

First
Code Point

Last
Code Point

Bytes in
Sequence

Byte 1Byte 2Byte 3Byte 4
7U+0000U+007F10xxxxxxx
   



11U+0080U+07FF2110xxxxx10xxxxxx
  


16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
 

21U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx10xxxxxx

UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].

...

Forming strings from character data containing partial characters can result in data corruption.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR00-J

Low

Unlikely

Medium

P2

L3

Automated Detection

ToolVersionCheckerDescription
Parasoft Jtest
Include Page
java:
Parasoft_V
java:
Parasoft_V
INTER.COS
Implemented
Do not use String concatenation in an Internationalized environment

Bibliography

...



...