Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

UTF-8 is a variable-width encoding for Unicode. UTF-8 uses one to four bytes per character, depending on the Unicode symbol. UTF-8 has the following properties.:

  • The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
  • All UCS characters beyond (0x7f) are encoded as a multibyte sequence consisting only of bytes in the range of 0x80 to 0xfd. This means that no ASCII byte (including a NULL byte) can appear as part of another character. This property supports the use of string handling functions.
  • It's easy to convert between UTF-8 and UCS-2 and UCS-4 fixed-width representations of characters.
  • The lexicographic sorting order of UCS-4 strings is preserved.
  • All possible 2^31 UCS codes can be encoded using UTF-8.

...

Wiki Markup
Although UTF-8 originated from the Plan 9 developers \[[Pike 931993|AA. Bibliography#Pike 93]\], Plan 9's own support only covers the low 16-bit range.  In general, many "Unicode" systems only support the low 16-bit range, not the full 31-bit ISO 10646 code space \[[ISO/IEC 10646:2003(E)|AA. Bibliography#ISO/IEC 10646-2003]\].

...

Wiki Markup
According to \[[Yergeau 981998|AA. Bibliography#Yergeau 98]\]:

Implementors of UTF-8 need to consider the security aspects of how they handle invalid UTF-8 sequences. It is conceivable that, in some circumstances, an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack can be carried out against a parser which that performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain invalid octet sequences as characters. For example, a parser might prohibit the null character when encoded as the single-octet sequence 00, but allow the invalid two-octet sequence C0 80 and interpret it as a null character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the invalid octet sequence 2F C0 AE 2E 2F.

...

Only the "shortest" form of UTF-8 should be permitted. Naive decoders might accept encodings that are longer than necessary, allowing for potentially dangerous input to have multiple representations. For example:

  1. Process A performs security checks, but does not check for non-shortest UTF-8 forms.
  2. Process B accepts the byte sequence from process A and transforms it into UTF-16 while interpreting possible non-shortest forms.
  3. The UTF-16 text may contain characters that should have been filtered out by process A and can potentially be dangerous. These non-"shortest" UTF-8 attacks have been used to bypass security validations in high-profile products, such as Microsoft's IIS web server.

Wiki Markup
[Corrigendum #1: UTF-8 Shortest Form|http://www.unicode.org/versions/corrigendum1.html] to the Unicode Standard \[[Unicode 062006|AA. Bibliography#Unicode 06]\] describes modifications to Version 3.0 of The Unicode Standard necessary to define what is meant by the shortest form.  

...

Wiki Markup
The following function from \[[Viega 03203|AA. Bibliography#Viega 03]\] detects invalid character sequences in a string but does not reject non-minimal forms. It returns {{1}} if the string is composed only of legitimate sequences; otherwise it returns {{0}}.

...

Recommendation

Severity

Likelihood

Remediation Cost

Priority

Level

MSC10-C

medium

unlikely

high

P2

L3

Automated Detection

...

Tool

Version

Checker

Description

Section

LDRA tool suite

...

Include Page
c:LDRA_V
c:LDRA_V

 

 

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Other Languages

Related Guidelines

This rule appears in the C++ Secure Coding Standard as : MSC10-CPP. Character Encoding - UTF8 Related Issues.

Bibliography

Wiki Markup
\[[ISO/IEC 10646:2003|AA. Bibliography#ISO/IEC 10646-2003]\]
\[[ISO/IEC PDTR 24772|AA. Bibliography#ISO/IEC PDTR 24772]\] "AJN Choice of Filenames and other External Identifiers"
\[[Kuhn 062006|AA. Bibliography#Kuhn 06]\]
\[[MITRE 072007|AA. Bibliography#MITRE 07]\] [CWE ID 176|http://cwe.mitre.org/data/definitions/176.html], "Failure to Handle Unicode Encoding," [CWE ID 116|http://cwe.mitre.org/data/definitions/116.html], "Improper Encoding or Escaping of Output" 
\[[Pike 931993|AA. Bibliography#Pike 93]\]
\[[Unicode 062006|AA. Bibliography#Unicode 06]\]  
\[[Viega 032003|AA. Bibliography#Viega 03]\] Section 3.12, "Detecting Illegal UTF-8 Characters"
\[[Wheeler 032003|AA. Bibliography#Wheeler 03]\]
\[[Yergeau 981998|AA. Bibliography#Yergeau 98]\]

...