Page History

...

UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 2²¹ Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.

Bits of Code Point	First Code Point	Last Code Point	Bytes in Sequence	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000	U+007F	1	`0xxxxxxx`


11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`


16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`


21	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].

...

Forming strings from character data containing partial characters can result in data corruption.

Rule	Severity	Likelihood	Remediation Cost	Priority	Level
STR00-J	Low	Unlikely	Medium	P2	L3

Automated Detection

Tool

Version

Checker

Description

Parasoft Jtest

Include Page

java:

	Parasoft_V

java:

	Parasoft_V

INTER.COS

Implemented

Do not use String concatenation in an Internationalized environment

Bibliography

[API 2014]	Classes `Character` and `BreakIterator`
[Java Tutorials]	Character Boundaries
[Phillips 2005]


[Seacord 2015]	Image Modified STR00-J. Don't form string containing partial characters from variable-width encodings LiveLesson
[Unicode 2012]

...

Space shortcuts

Page tree

Versions Compared

Old Version 22

New Version 23

Key

Automated Detection

Bibliography