Skip to end of metadata
Go to start of metadata

Using locale-dependent methods on locale-dependent data can produce unexpected results when the locale is unspecified. Programming language identifiers, protocol keys, and HTML tags are often specified in a particular locale, usually Locale.ENGLISH. Running a program in a different locale may result in unexpected program behavior or even allow an attacker to bypass input filters. For these reasons, any program that inspects data generated by a locale-dependent function must specify the locale used to generate that data.

For example, the following program:

public class Example {
  public static void main(String[] args) {
    System.out.println("Title".toUpperCase());
  }
}

behaves as expected in an English locale:

% java Example
TITLE
% 

However, most languages that use the Latin alphabet associate the letter I as the uppercase version of i. But Turkish is an exception: it has a dotted i whose uppercase version is also dotted (İ) and an undotted ı whose uppercase version is undotted (I). Changing capitalization on most strings in the Turkish locale [API 2006] may produce unexpected results:

% java -Duser.language=tr Example
TİTLE
% 

Many programs only use locale-dependent methods for outputting information, such as dates provided that the locale-dependent data is not inspected by the program, and it may safely rely on the default locale.

Noncompliant Code Example (toUpperCase())

Many web apps, such as forum or blogging software, input HTML and then display it. Displaying untrusted HTML can subject a web app to cross-site scripting (XSS) or HTML injection vulnerabilities. Therefore, it is vital that HTML be sanitized before sending it to a web browser.

One common step in sanitization is identifying tags that may contain malicious content. The <SCRIPT> tag typically contains JavaScript code that is executed by a client's browser. Consequently, HTML input is commonly filtered for <SCRIPT> tags. However, identifying <SCRIPT> tags is not as simple as it appears.

In HTML, tags are case-insensitive and consequently can be specified using uppercase, lowercase, or any mixture of cases. This noncompliant code example uses the locale-dependent String.toUpperCase() method to convert an HTML tag to uppercase to check it for further processing. The code must ignore <SCRIPT> tags, as they indicate code that is to be discarded. Whereas the English locale would convert "script" to "SCRIPT", the Turkish locale will convert "script" to "SCRİPT", and the check will fail to detect the <SCRIPT> tag.

public static void processTag(String tag) {
  if (tag.toUpperCase().equals("SCRIPT")) {
    return;
  } 
  // Process tag
}

Compliant Solution (Explicit Locale)

This compliant solution explicitly sets the locale to English to avoid unexpected results:

public static void processTag(String tag) {
  if (tag.toUpperCase(Locale.ENGLISH).equals("SCRIPT")) {
    return;
  }
  // Process tag
}

Specifying Locale.ROOT is a suitable alternative when an English-specific locale would not be appropriate.

Compliant Solution (Default Locale)

This compliant solution sets the default locale to English before performing string comparisons:

public static void processTag(String tag) {
  Locale.setDefault(Locale.ENGLISH);

  if (tag.toUpperCase().equals("SCRIPT")) {
    return;
  }
  // Process tag
}

Compliant Solution (String.equalsIgnoreCase())

This compliant solution bypasses locales entirely by performing a case-insensitive match. The String.equalsIgnoreCase() method creates temporary canonical forms of both strings, which may render them unreadable, but it performs proper comparison without making them dependent on the current locale [Schindler 12].

public static void processTag(String tag) {
  if (tag.equalsIgnoreCase("SCRIPT")) {
    return;
  }
  // Process tag
}

This solution is compliant because equalIgnoreCase() compares two strings, one of which is plain ASCII, and therefore its behavior is well-understood, even if the other string is not plain ASCII. Calling equalIgnoreCase() where both strings may not be ASCII is not recommended, simply because equalIgnoreCase() may not behave as expected by the developer.

Noncompliant Code Example (FileReader)

Java provides classes for handling input and output, which can be based on either bytes or characters. The byte I/O families derive from the InputStream and OutputStream interfaces and are independent of locale or character encoding. However, the character I/O families derive from Reader and Writer, and they must convert byte sequences into strings and back, so they rely on a specified character encoding to do their conversion. This encoding is indicated by the file.encoding system property, which is part of the current locale. Consequently, a file encoded with one encoding, such as UTF-8, must not be read by a character input method using a different encoding, such as UTF-16.

Programs that read character data (whether directly using a Reader or indirectly using some method such as constructing a String from a byte array) must be aware of the source of the data. If the encoding of the data is fixed (such as if the data comes from a file resource that is shipped with the program), then that encoding must be specified by the program. Failure to specify the coding enables an attacker to change the encoding to force the program to read the data using the wrong encoding.

This risk does not apply to programs that read data known to be in the encoding specified by the platform running the program. For example, if the program must open a file provided by the user, it is reasonable to rely on the default encoding, expecting that it will be set correctly.

This noncompliant code example reads its own source code and prints it out, prepending each line with a line number. If the program is run with the argument -Dfile.encoding=UTF16 while its source file is stored as UTF8, the program will save garbage in the output file.

import java.io.*;

public class PrintMyself {
  private static String inputFile = "PrintMyself.java";
  private static String outputFile = "PrintMyself.txt";

  public static void main(String[] args) throws IOException {
    BufferedReader reader = new BufferedReader(new FileReader(inputFile));
    PrintWriter writer = new PrintWriter(new FileWriter(outputFile));
    int line = 0;
    while (reader.ready()) {
      line++;
      writer.println(line + ": " + reader.readLine());
    }
    reader.close();
    writer.close();
  }
}

Compliant Solution (Charset)

In this compliant solution, both the input and output files are explicitly encoded using UTF8. This program behaves correctly regardless of the default encoding.

  public static void main(String[] args) throws IOException {
    Charset encoding = Charset.forName("UTF8");
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile), encoding));
    PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputFile), encoding));

    int line = 0;

    /* Rest of code unchanged */

Noncompliant Code Example (Date)

The concepts of days and years are universal, but the way in which dates are represented varies across cultures and are therefore specific to locales. This noncompliant code example examines the current date and prints one of two messages, depending on whether or not the month is June:

import java.util.Date;
import java.text.DateFormat;
import java.util.Locale;

// ...

public static void isJune(Date date) {
  String myString = DateFormat.getDateInstance().format(date);
  System.out.println("The date is " + myString);
  if (myString.startsWith("Jun ")) {
    System.out.println("Enjoy June!");
  } else {
    System.out.println("It's not June.");
  }
}

This program behaves as expected on platforms with a US locale:

The date is Jun 20, 2014
Enjoy June!

but fails on other locales. For example, the output for a German locale (specified by -Duser.language=de) is

The date is 20.06.2014
It's not June.

Compliant Solution (Explicit Locale)

This compliant solution forces the date to be printed in an English format, regardless of the current locale:

String myString = DateFormat.getDateInstance(DateFormat.MEDIUM, Locale.US).format(rightNow.getTime());
/* Rest of code unchanged */

Compliant Solution (Bypass Locale)

This compliant solution checks the date's MONTH attribute without formatting it. Although date representations vary by culture, the contents of a Calendar date do not. Consequently, this code works in any locale.

if (rightNow.get(Calendar.MONTH) == Calendar.JUNE) {
/* Rest of code unchanged */

Risk Assessment

Failure to specify the appropriate locale when using locale-dependent methods on local-dependent data without specifying the appropriate locale may result in unexpected behavior.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR02-J

Medium

Probable

Medium

P8

L2

Automated Detection

ToolVersionCheckerDescription
The Checker Framework

2.1.3

Tainting CheckerTrust and security errors (see Chapter 8)
CodeSonar
5.1p0
FB.I18N.DM_CONVERT_CASE
FB.I18N.DM_DEFAULT_ENCODING

PMD.Design.SimpleDateFormatNeedsLocale
PMD.Design.UseLocaleWithCaseConversions
Consider using Locale parameterized version of invoked method
Reliance on default encoding
Simple date format needs Locale
Use Locale with case conversions
Parasoft Jtest
10.3
INTER.{CCL,CTLC}Implemented
SonarQube
6.7
S1449Locale should be used in String operations

Android Implementation Details

A developer can specify locale on Android using java.util.Locale.

Bibliography



13 Comments

  1. I see that the Turkish capitalization issue described in the code examples here also appears in the Java tutorial about localization. I wonder if there are any other examples where this is a problem?

    Forcing people to specify the locale in every locale-dependent function is overkill. A better solution is to explicitly specify the locale at the beginning of the program. Or give the program a constraint "this program won't work in Turkish" :)

    At any rate, the error is simply assuming that you'll get a specific result from a locale-dependent function. If you don't compare the uppercasing of "title" to anything (suppose you only print it to a window), then it doesn't matter what locale you use.

    So I suspect the normative text here should be to either specify the locale you use *somewhere* in the code, or to make no assumptions of what a locale-dependent function like toUpperCase() returns.

  2. This is a very good recommendation. Uwe Schindler has a whole project called Forbidden-APIs that helps stop this stuff. https://code.google.com/p/forbidden-apis/ and has a cool blog at http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html where the result of a cellphone company not supporting a particular letter resulted in a murder-suicide. Incidentally from the Java tutorial you mention, that arose from Turkish letters.

    I don’t really know what the solution is here, but I’m wary that forcing developers to provide it all the time may actually make the situation worse in cases where people start specifying the wrong one or say that some file is in a charset when it isn’t. There's also going to be an annoying race condition if different libraries call Locale.setDefault and keep overriding each-other.

    In the example above, "title".toUpperCase(Locale.ENGLISH) is easy, but what about a non-constant value coming from somewhere else?

    Maybe people specify this at runtime, maybe it’s better somewhere else.

  3. Is it correct to refer to these languages as "European"?  Shouldn't we say "languages which use the Latin alphabet"?

    1. You can make that change if you wish. The ISO-8859-1 covers special characters in most of these languages. They are predominantely European, but there are a few non-European outliers (Malay, Swahili). And of course, plenty o European cultures use a different alphabet.

      1. I made this change.  On a somewhat related note, I'm wondering about this sentence:

        Specifying Locale.ROOT is a suitable alternative under conditions where an English-specific locale would not be appropriate.

        Is this too Anglocentric?  The description of Locale.ROOT from the API documentation makes it sound like it is custom made for this purpose:

        The root locale is the locale whose language, country, and variant are empty ("") strings. This is regarded as the base locale of all locales, and is used as the language/country neutral locale for the locale sensitive operations.   

        I think the key to this rule is that the locale needs to be set to something so that locale-dependent methods are constrained in what results they may find.  So for example, of the locale is explicitly set to French (or any language) this would not be flagged as a violation of this rule.  Even if the locale is set to Turkish, if the comparison was modified accordingly.

        1. Is this too Anglocentric?  The description of Locale.ROOT from the API documentation makes it sound like it is custom made for this purpose:

          Only when taken out of context. The context was a code sample that used Locale.English...it could have easily used French or some other locale.

          I think the key to this rule is that the locale needs to be set to something so that locale-dependent methods are constrained in what results they may find. 

          The key to this rule is that the locale is a system property. (On some systems it is an environment variable). We have rules saying ENV02-J. Do not trust the values of environment variables, this rule should be considered an instance of that one. 

  4. Hi..I just wanted a re-confirmation on one of the compliant solutions listed....usage of String.equalsIgnoreCase().  This one does case mapping using Character.toUpperCase()/toLowerCase(). 

    When you read the JavaDoc for these it states the following warning shown below.  Given this warning, I just wanted a reconfirmation that this would work for all Locales.

    Apologies if the question sounds naive as I am not an expert on the subject.

    * Note that
    * {@code Character.isUpperCase(Character.toUpperCase(ch))}
    * does not always return {@code true} for some ranges of
    * characters, particularly those that are symbols or ideographs.
    *
    * <p>In general, {@link String#toUpperCase()} should be used to map
    * characters to uppercase. {@code String} case mapping methods
    * have several benefits over {@code Character} case mapping methods.
    * {@code String} case mapping methods can perform locale-sensitive
    * mappings, context-sensitive mappings, and 1:M character mappings, whereas
    * the {@code Character} case mapping methods cannot.
    *
    * <p><b>Note:</b> This method cannot handle <a
    * href="#supplementary"> supplementary characters</a>. To support
    * all Unicode characters, including supplementary characters, use
    * the {@link #toUpperCase(int)} method.

    Some additional background

    Since Java does not provide equivalent methods like startsWithIgnoreCase() & endsWithIgnoreCase() we were looking to implement something for this by looking at equalsIgnoreCase(). 

    This uses String.regionMatches().  So technically we could do implement the required methods.  In fact Apache Commons Lang already provides this as part of their StringUtils & CharacterSequenceUtils.

    The String.regionMatches() uses Character.toUpperCase()/toLowerCase() and that's when we came across this.

    Thanks in advance.

    1. I'm not sure which Javadoc you are quoting. But the Java 9 String.equalsIgnoreCase() API doc says:

      Note that this method does not take locale into account, and will result in unsatisfactory results for certain locales. The Collator class provides locale-sensitive comparison.

      This paragraph is notably absent in the Java 8 String.equalsIgnoreCase() doc.

      The method does use the same underlying mechanism as regionMatches() so if you trust one you should trust the other.

      1. Thanks for the quick response!

        I was quoting the JavaDoc of Character.toUpperCase()/toLowerCase() which the String.regionMatches() method uses. So the question really is, given the statement above from your quote "...and will result in unsatisfactory results for certain locales" is using equalsIgnoreCase() a truly compliant solution, given that they state it would give unsatisfactory results for certain locales?

        1. I had assumed that people might give it cases where they wrongly expect equivalence, such as "NAIVE".equalIgnoreCase("naïve")


          1. David Svoboda may be I am not understanding what is stated in the equalsIgnoreCase() compliant solution correctly. 

            The example given is for comparison with a constant "SCRIPT".  So is it the case that this is a compliant solution when comparing constants & English locale strings?  If you are comparing Locale sensitive user input data, one should rather normalize using String.toUpperCase/toLowerCase(Locale) methods and call String.equals().

            Besides the JavaDoc warnings above, I was able to locate just one link below which cautions regarding this:

            https://www.globalyzer.com/gzserver/help/reference/localeSensitiveMethods/javaunsafe_equalsIgnoreCase.html

            1. Ranjan George :

              I agree that equalIgnoreCase() may surprise developers when called with two non-ASCII strings. This is due more to developers' expectations than the actual specification of how equalIgnoreCase() works. Nonetheless, I added a note to the relevant compliant solution.