The Trap of Case-Insensitive String

July 07, 2011

Let's start from an example. What's the expect result of the following simple code?

// Example One
    String lower = "Simple Smiles!";
    String upper = "SIMPLE SMILES!";

    int lowerHashCode = lower.toUpperCase().hashCode();
    int upperHashCode = upper.toUpperCase().hashCode();
    boolean isEqual = (lowerHashCode == upperHashCode);

    System.out.println("The hash codes of the two case-insensitive " +
        "strings are the same: " + isEqual);

What's the value of "isEqual" variable, "true" or "false"? If you try to compiler and run the above code, I believe, 99.9999 times out of 100, the value of "isEqual" is "true". What's the chance for the remaining 0.0001?

OK, let's run one more example. Copy/past the following code, and run it.

// Example Two

//
// To illustrate the unexpected behaviors of case-insensitive strings
//
import java.util.Locale;

public class InsensitiveCases {

    public static void main(String[] args) throws Exception {
        String lower = "Simple Smiles!";
        String upper = "SIMPLE SMILES!";

        for (Locale locale : Locale.getAvailableLocales()) {
            // reset the default locale
            Locale.setDefault(locale);
            System.out.println("In locale: " + locale);

            // check this local
            System.out.println("\tcomparing in lower case\n\t\t" +
                intuitiveCompare(lower.toLowerCase(), upper.toLowerCase()));
            System.out.println("\tcomparing in upper case\n\t\t" +
                intuitiveCompare(lower.toUpperCase(), upper.toUpperCase()));
            System.out.println("\thashcoding in lower case\n\t\t" +
                intuitiveHashCode(lower.toUpperCase(), upper.toUpperCase()));
            System.out.println("\thashcoding in upper case\n\t\t" +
                intuitiveHashCode(lower.toUpperCase(), upper.toUpperCase()));

        }
    }

    private static String intuitiveCompare(String lower, String upper) {
        int diff = lower.compareTo(upper);
        if (diff == 0) {
            return "lower == upper";
        } else if (diff > 0) {
            return "lower > upper";
        }

        return "lower < upper";
    }

    private static String intuitiveHashCode(String lower, String upper) {
        int lowerCode = lower.hashCode();
        int upperCode = upper.hashCode();
        if (lowerCode == upperCode) {
            return "lower hash code == upper hash code ";
        }

        return "lower hash code != upper hash code ";
    }
}

You may find the record from the log:

In locale: tr_TR
        comparing in lower case
                lower < upper
        comparing in upper case
                lower > upper
        hashcoding in lower case
                lower hash code != upper hash code
        hashcoding in upper case
                lower hash code != upper hash code

It's interesting that, in locale tr_TR, if we convert the strings into lower case, "Simple Smiles!" < "SIMPLE SMILES!"; while we convert the strings into upper case, the result is completely different, "Simple Smiles!" > "SIMPLE SMILES!". Of course, the hash code cannot be equal any more. What's wrong with it? We can get the answer from the spec of String.toUpperCase() or String.toLowerCase().

Note: This method is locale sensitive, and may produce unexpected
    results if used for strings that are intended to be interpreted
    locale independently. Examples are programming language identifiers,
    protocol keys, and HTML tags. For instance, "title".toUpperCase()
    in a Turkish locale returns "T\u0130TLE", where '\u0130' is the
    LATIN CAPITAL LETTER I WITH DOT ABOVE character. To obtain correct
    results for locale insensitive strings, use toUpperCase(Locale.ENGLISH).

It's an important note to operate on case-insensitive strings. However, it's not easy to pay enough attention to this note in practice. So we may be find the following code here and there:

// Example Three
    public int hashCode() {
        return theStringObject.toUpperCase().hashCode();
    }

OR
// Example Four
    theStringObject.toUpperCase().equals("MYSTRING");

OR
// Example Five
    theStringObject.toUpperCase().compareTo("MYSTRING");

From the above note, we know that we cannot expect the above three example always work correctly. To avoid the unexpected behaviors, pay attention to the above note, especially, "To obtain correct results for locale insensitive strings, use toUpperCase(Locale.ENGLISH)." Let's try revise the above three examples:

// Example Three: Revised
    public int hashCode() {
        return theStringObject.toUpperCase(Locale.ENGLISH).hashCode();
    }

OR
// Example Four: Revised
    theStringObject.toUpperCase(Locale.ENGLISH).equals("MYSTRING");

OR
// Example Five: Revised
    theStringObject.toUpperCase(Locale.ENGLISH).compareTo("MYSTRING");

If your program runs into strange behaviors, and you happen to find the String.toUpperCase() or String.toLowerCase() calls in your code, you may think about the trap and above note.

Good luck!

Postscript: You may also want to read more about how to check the trap with a simple shell script.

Search This Blog

Xuelei Fan's Blog

The Trap of Case-Insensitive String

Popular posts from this blog

NIST Security Strength Time Frames

Use Braces Even For Single Line Statement

TLS Server Name Indication Extension and Unrecognized_name