Page History

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.:

The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^31 UCS codes can be encoded using UTF-8.

...

Wiki Markup

Although UTF-8 originated from the Plan 9 developers \[[Pike 931993|AA. Bibliography#Pike 93]\], Plan 9's own support only covers the low 16-bit range.  In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. Bibliography#ISO/IEC 10646-2003]\].

...

Wiki Markup
According to \[[Yergeau 981998\|AA. Bibliography#Yergeau 98]\]:

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack can be carried out against a parser which that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

...

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

Process A performs security checks, but does not check for non-shortest UTF-8 forms.
Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible non-shortest forms.
The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.

Wiki Markup

[Corrigendum #1: UTF-8 Shortest Form|http://www.unicode.org/versions/corrigendum1.html] to the Unicode Standard \[[Unicode 062006|AA. Bibliography#Unicode 06]\] describes modifications to Version 3.0 of The Unicode Standard necessary to define what is meant by the shortest form.

...

Wiki Markup

The following function from \[[Viega 03203|AA. Bibliography#Viega 03]\] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is composed only of legitimate sequences; otherwise it returns {{0}}.

...

Recommendation	Severity	Likelihood	Remediation Cost	Priority	Level
MSC10-C	medium	unlikely	high	P2	L3

Automated Detection

...

Tool
Version
Checker
Description
Section
LDRA tool suite
...
Include Page
c:LDRA_V
c:LDRA_V

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Other Languages

Related Guidelines

This rule appears in the C++ Secure Coding Standard as : MSC10-CPP. Character Encoding - UTF8 Related Issues.

Bibliography

Wiki Markup

\[[ISO/IEC 10646:2003|AA. Bibliography#ISO/IEC 10646-2003]\]
\[[ISO/IEC PDTR 24772|AA. Bibliography#ISO/IEC PDTR 24772]\] "AJN Choice of Filenames and other External Identifiers"
\[[Kuhn 062006|AA. Bibliography#Kuhn 06]\]
\[[MITRE 072007|AA. Bibliography#MITRE 07]\] [CWE ID 176|http://cwe.mitre.org/data/definitions/176.html], "Failure to Handle Unicode Encoding," [CWE ID 116|http://cwe.mitre.org/data/definitions/116.html], "Improper Encoding or Escaping of Output" 
\[[Pike 931993|AA. Bibliography#Pike 93]\]
\[[Unicode 062006|AA. Bibliography#Unicode 06]\]  
\[[Viega 032003|AA. Bibliography#Viega 03]\] Section 3.12, "Detecting Illegal UTF-8 Characters"
\[[Wheeler 032003|AA. Bibliography#Wheeler 03]\]
\[[Yergeau 981998|AA. Bibliography#Yergeau 98]\]

...

Space shortcuts

Page tree

Versions Compared

Old Version 52

New Version 53

Key

Automated Detection

Tool
Version
Checker
Description
Section
LDRA tool suite
...
Include Page
c:LDRA_V
c:LDRA_V

Related Vulnerabilities

Other Languages

Related Guidelines

Bibliography

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 52

New Version 53

Key

Automated Detection

ToolVersionCheckerDescription SectionLDRA tool suite... Include Pagec:LDRA_Vc:LDRA_V

Related Vulnerabilities

Other Languages

Related Guidelines

Bibliography

Tool
Version
Checker
Description
Section
LDRA tool suite
...
Include Page
c:LDRA_V
c:LDRA_V