 
                            Many applications that accept untrusted input strings employ input filtering and validation mechanisms that black-list charactersbased on the strings' character data. For example, an application may not want to accept <script> tags to avoid vulnerabilities such as Cross Site Scripting (XSS). Such validation must be performed after normalizing the input.'s strategy for avoiding cross-site scripting (XSS) vulnerabilities may include forbidding <script> tags in inputs. Such blacklisting mechanisms are a useful part of a security strategy, even though they are insufficient for complete input validation and sanitization.
Character information in Java is based on the Unicode Standard. The following table shows the version of Unicode supported by the latest three releases of Java SE.
| Java Version | Unicode Version | 
|---|---|
| Java SE 6 | Unicode Standard, version 4.0 [Unicode 2003] | 
| Java SE 7 | Unicode Standard, version 6.0.0 [Unicode 2011] | 
| Java SE 8 | Unicode Standard, version 6.2.0 [Unicode 2012] | 
Applications that accept untrusted input should normalize the input before validating it.  Normalization is important because in Unicode, the same string can have many different representations.  According to the Unicode Standard [Davis 2008], annex #15, Unicode Normalization Wiki Markup 
When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets ...
The normalization form KC (NFKC) is the most suitable for performing input validation because the input is transformed into an equivalent canonical form that can be safely compared with the required form.
Noncompliant Code Example
Noncompliant Code Example
The Normalizer.normalize() method transforms Unicode text into the standard normalization forms described in Unicode Standard Annex #15 Unicode Normalization Forms. Frequently, the most suitable normalization form for performing input validation on arbitrarily encoded strings is KC (NFKC) .
This noncompliant code example attempts to validate the String before performing normalization.This noncompliant code example validates the String before performing the normalization. Consequently, an attacker can get past the validation logic because the angle brackets being checked for have alternative unicode representations that need to be normalized before any validation can be performed.
| Code Block | ||
|---|---|---|
| 
 | ||
| // String s may be user controllable // \uFE64 is normalized to << and \uFE64uFE65 is normalized to >> using KFKC the NFKC normalization form String s = ""\uFE64"" + "script""script" + ""\uFE65""; //validate Validate Pattern pattern = Pattern.compile("[<>]""[<>]"); // checkCheck for angle brackets Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("found// Found black listed tag" throw new IllegalStateException(); } else { // ... } // normalizeNormalize s = Normalizer.normalize(s, Form.NFKC); | 
<script> tag because it is not normalized at the time. Therefore the system accepts the invalid input.Compliant Solution
This compliant solution normalizes the string before validating it. Alternative representations of the string are normalized to the canonical angle brackets. Input validation succeeds and Consequently, input validation correctly detects the malicious input and throws an IllegalStateException results.
| Code Block | ||
|---|---|---|
| 
 | ||
| String s = ""\uFE64"" + "script""script" + ""\uFE65""; // normalizeNormalize s = Normalizer.normalize(s, Form.NFKC); //validate Validate Pattern pattern = Pattern.compile("[<>]"); "[<>]"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("found black listed tag"); // Found blacklisted tag throw new IllegalStateException(); } else { // ... } | 
Risk Assessment
Validating input before normalization can allow affords attackers the opportunity to bypass filters and other security mechanisms. This It can result in the execution of arbitrary code.
| Rule | Severity | Likelihood | 
|---|
| Detectable | Repairable | Priority | Level | 
|---|
| IDS01-J | High | 
| Probable | 
| No | 
| No | 
| P6 | 
| L2 | 
Automated Detection
...
TODO
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
References
| Wiki Markup | 
|---|
| \[[API 06|AA. Java References#API 06]\] 
\[[Unicode 08|AA. Java References#Unicode 08]\]
\[[Weber 09|AA. Java References#Weber 09]\]
\[[MITRE 09|AA. Java References#MITRE 09]\] [CWE ID 289|http://cwe.mitre.org/data/definitions/289.html] "Authentication Bypass by Alternate Name" and [CWE ID 180|http://cwe.mitre.org/data/definitions/289.html] "Incorrect Behavior Order: Validate Before Canonicalize" | 
| Tool | Version | Checker | Description | ||||||
|---|---|---|---|---|---|---|---|---|---|
| The Checker Framework | 
 | Tainting Checker | Trust and security errors (see Chapter 8) | ||||||
| Fortify | 1.0 | Process_Control | Implemented | ||||||
| Klocwork | 
 | SV.TAINT | 
Related Guidelines
| Cross-site Scripting [XYT] | |
| CWE-289, Authentication bypass by alternate name | 
Android Implementation Details
Android apps can receive string data from the outside and normalize it.
Bibliography
...
IDS09-J. Account for supplementary and combining characters in globalized code      10. Input Validation and Data Sanitization (IDS)      IDS11-J. Do not delete non-character code points