Page History

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one 1 to four 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.:

The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character.
It 's is easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^31 2^21 UCS codes can be encoded using UTF-8.

Generally, all programs should perform checks for any validate UTF-8 data for UTF-8 legality before performing other checks. The table below listed all Legal following table lists the well-formed UTF-8 Sequences.

UCS Code (HEX)	Binary UTF-8 Format	Legal UTF-8 Values (HEX)
00-7F	0xxxxxxx	00-7F
80-7FF	110xxxxx 10xxxxxx	C2-DF 80-BF
800-FFF	1110xxxx 10xxxxxx 10xxxxxx	E0 A0*-BF 80-BF
1000-FFFF	1110xxxx 10xxxxxx 10xxxxxx	E1-EF 80-BF 80-BF
10000-3FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F0 90*-BF 80-BF 80-BF
40000-FFFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F1-F3 80-BF 80-BF 80-BF
40000-FFFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F1-F3 80-BF 80-BF 80-BF
100000-10FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F4 80-8F* 80-BF 80-BF

Security Related Issues

Only Accept the "shortest" form

byte sequences.

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000	U+007F	1	`0xxxxxxx`
11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Although UTF-8 originated from the Plan 9 developers [Pike 1993], Plan 9's own support covers only the low 16-bit range. In general, many "Unicode" systems support only the low 16-bit range, not the full 21-bit ISO 10646 code space [ISO/IEC 10646:2012].

Security-Related Issues

According to RFC 2279: UTF-8, a transformation format of ISO 10646 [Yergeau 1998],

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

Following are more specific recommendations.

Accept Only the Shortest Form

Only The UTF-8 encoding scheme is fairly simple, but there are a few clarifications that are important for security reasons. One of the most important ones is the requirement that only the "shortest" form of UTF-8 should be permitted. Naive decoder may decoders might accept encoding encodings that are longer than necessary, this means that allowing for potentially dangerous input could be represented multiple ways, and this will defeat the security checking for dangerous inputsto have multiple representations. For example:,

Process A perfoms performs security checks , but does not check for non-shortest nonshortest UTF-8 forms.
Process B accepts the byte sequence from process A , and transform transforms it into UTF-16 while interpreting possible non-shortest nonshortest forms.
The UTF-16 text may then contain characters that should have been filtered out by process A , and could can potentially be dangerous.

...

These "nonshortest" UTF-8

...

attacks have been used to bypass security validations in high-profile products

...

, such as Microsoft's IIS

...

Web server.

Corrigendum #1: UTF-8 Shortest Form to the Unicode Standard [Unicode 2006] describes modifications made to version 3.0 of the Unicode Standard to forbid the interpretation of the nonshortest forms.

Handling Invalid Inputs

Upon receiving a invalid form of UTF-8 , there is not a decoders have no uniformly defined responds/behavior define by the standard for a UTF-8 decoder. In general, there behavior upon encountering an invalid input. Following are several ways that a UTF-8 may decoder might behave in the event of an invalid byte sequence:. Note that implementing these behaviors requires careful security considerations.

Substitute for the replacement character "U+FFFD" or the wildcard character such as "?" when U+FFFD is not available.
Ignore the bytes (for example, delete the invalid byte before the validation process; see "Unicode Technical Report #36, 3.5 Deletion of Code Points" for more information)
Insert a replacement character (e.g. "?" the wild card character)
Ignore the bytes.
Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map; other encoding, such as Shift_JIS, is known to trigger self-XSS, and so is potentially dangerous).
Not notice and Fail to notice but decode as if the bytes were some similar bit of UTF-8.
Stop decoding and report an error.

The following function from Viega 03 will detect illegal John Viega's "Protecting Sensitive Data in Memory" [Viega 2003] detects invalid character sequences in a string but does not reject nonminimal forms. It returns 1 if the string is comprised composed only of legitimate sequences; otherwise, else it returns 0:.

Code Block


int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;
  
  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) =  = 0x80) return 0;
    else if ((*c & 0xe0) =  = 0xc0) nb = 1;
    else if ((*c & 0xf0) =  = 0xe0) nb = 2;
    else if ((*c & 0xf8) =  = 0xf0) nb = 3;
    else if ((*c & 0xfc) =  = 0xf8) nb = 4;
    else if ((*c & 0xfe) =  = 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  } 
  return 1;
}

Broken Surrogates

The most recent requirement for UTF-8 encoding is that encoding Encoding of individual or out-of-order surrogate halves should not be permitted. Broken surrogates are illegal invalid in Unicode , so they and introduce ambiguity when they appear in Unicode data. Again they could be used to create strings that appeared similar but were not really similar, particularly when applications ignore the bad data. Broken surrogates could be Broken surrogates are often signs of bad data transmission. They could can also indicate internal bugs in an application or intentional efforts to find security problemsvulnerabilities.

Another requirement is that encoding of individual or out of order surrogate halves should not be permitted.

Reference

The RFC describes the problem this way: Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet (byte) sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as a character. For example, a parser might prohibit the NUL character when encoded as single-octet sequence 00, but allow the illegal two-octet sequence C0 80 (illegal because it's longer than necessary) and interpret it as a NUL character (00). Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F c) AE 2E 2F.

...

Risk Assessment

Failing to properly handle UTF8-encoded data can result in a data integrity violation or denial-of-service attack.

Recommendation	Severity	Likelihood	Detectable	Repairable	Priority	Level
MSC10-C	Medium	Unlikely	No	No	P2	L3

Automated Detection

Tool

Version

Checker

Description

LDRA tool suite

Include Page

	LDRA_V
	LDRA_V

176 S, 376 S

Partially implemented

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Related Guidelines

SEI CERT C++ Coding Standard	VOID MSC10-CPP. Character encoding: UTF8-related issues
MITRE CWE	CWE-176, Failure to handle Unicode encoding CWE-116, Improper encoding or escaping of output

Bibliography

[ISO/IEC 10646:2012]
[Kuhn 2006]	UTF-8 and Unicode FAQ for Unix/Linux

...

[Pike 1993]	"Hello World"
[Unicode 2006]
[Viega 2003]	Section 3.12

...

, "Detecting Illegal UTF-8 Characters"
[Wheeler 2003]	Secure Programmer: Call Components Safely
[Yergeau 1998]	RFC 2279

...

Image Added Image Added Image Added 06 Secure Programming for Linux and Unix HOWTO
Yergeau 98 RFC 2279 - UTF-8, a transformation format of ISO 10646

Space shortcuts

Page tree

Versions Compared

Old Version 11

New Version Current

Key