 
                            Many applications employ input filtering and validation mechanisms that black-list characters. For example, an application may not want to accept <script> tags to avoid vulnerabilities such as Cross Site Scripting (XSS). Such validation must be performed after normalizing the input.
According to the Unicode Standard [[Unicode 08]], annex #15, Unicode Normalization Forms:
When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets ...
The normalization form KC (NFKC) is the most suitable for performing input validation because the input is transformed into an equivalent canonical form that can be safely compared with the required form.
Noncompliant Code Example
This noncompliant code example validates the String before performing the normalization. Consequently, an attacker can get past the validation logic because the angle brackets being checked for have alternative unicode representations that need to be normalized before any validation can be performed.
// String s may be user controllable
// \uFE64 is normalized to < and \uFE64 is normalized to > using KFKC
String s = "\uFE64" + "script" + "\uFE65"; 
//validate
Pattern pattern = Pattern.compile("[<>]"); // check for angle brackets
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
  System.out.println("found black listed tag");
} else {
  // ... 
}
// normalize
s = Normalizer.normalize(s, Form.NFKC); 
Compliant Solution
This compliant solution normalizes the string before validating it. Alternative representations of the string are normalized to the canonical angle brackets. Input validation succeeds and an IllegalStateException results.
String s = "\uFE64" + "script" + "\uFE65";
// normalize
s = Normalizer.normalize(s, Form.NFKC); 
//validate
Pattern pattern = Pattern.compile("[<>]"); 
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
  System.out.println("found black listed tag"); 
  throw new IllegalStateException();
} else {
  // ... 
}
Risk Assessment
Validating input before normalization can allow attackers to bypass filters and other security mechanisms. This can result in the execution of arbitrary code.
| Rule | Severity | Likelihood | Remediation Cost | Priority | Level | 
|---|---|---|---|---|---|
| MSC41-J | high | probable | medium | P12 | L1 | 
Automated Detection
TODO
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
References
[[API 06]] 
[[Unicode 08]]
[[Weber 09]]
[[MITRE 09]] CWE ID 289 "Authentication Bypass by Alternate Name" and CWE ID 180
 "Authentication Bypass by Alternate Name" and CWE ID 180 "Incorrect Behavior Order: Validate Before Canonicalize"
 "Incorrect Behavior Order: Validate Before Canonicalize"
IDS09-J. Account for supplementary and combining characters in globalized code 49. Miscellaneous (MSC) 99. The Void (VOID)