...
The trailing byte ranges overlap the range of both the single-byte and lead-byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises because of the ambiguity of its composing bytes [Phillips 2005].
Supplementary Characters
The char data type is based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.Characters whose code points are greater than U+FFFF are called supplementary characters. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names. To support supplementary characters without changing the char primitive data type and causing incompatibility with previous Java programs, supplementary characters are defined by a pair of code point values that are called surrogates. According to the Java API [API 2014] class Character documentation (Unicode Character Representations):
...
This noncompliant code example attempts to trim leading letters from string. However, this method may fail because methods that only accept a char value cannot support supplementary characters. According to the Java API [API 2014] class Character documentation:
...
This compliant solution works both for supplementary and for combining characters [Tutorials 2008]. According to the Java API [API 2006] class java.text.BreakIterator documentation:
...
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
|---|---|---|---|---|---|
STR50-J | low | unlikely | medium | P2 | L3 |
Bibliography
[API 2014] | Classes |
Character Boundaries |
Rec. 04: Characters and Strings (STR) Rec. 04: Characters and Strings (STR)