Java tip: How to parse integers quickly

Technologies: Java 5+

Java has several ways to parse integers from strings. Performance differences between these methods can be significant when parsing a large number of integers. Doing your own integer parsing can provide an important speed boost. This tip looks at five ways to parse integers, compares their features, and benchmarks them to see which method is the fastest.

Parsing an integer from a string using the JDK

An integer is a number like "-123". It has an optional negative sign followed by a series of digits. There is no decimal point or exponent.

Java has two principal ways to parse an integer from a string:

  • Call parseInt( ) on java.lang.Integer
  • Call parse( ) on an instance of java.text.NumberFormat

java.io.StreamTokenizer is also available when tokenizing a long list of values.

Here's how to use each one.

Integer.parseInt

try
{
    int value = Integer.parseInt( string );
}
catch ( NumberFormatException e )
{
    ... do something about the error ...
}

The static parseInt( ) method on java.lang.Integer takes a String argument and returns an int.

Integer throws a java.lang.NumberFormatException if it can't parse the string.

NumberFormat.parse

NumberFormat format =
    NumberFormat.getIntegerInstance( );
try
{
    Number number = format.parse( string );
    int value = number.intValue( );
}
catch ( ParseException e )
{
    ... do something about the error ...
}

To use java.text.NumberFormat, create an integer parser by calling the static NumberFormat.getIntegerInstance( ) method. The format's parse( ) method takes a String and returns a java.lang.Number. The intValue( ) method on Number returns the integer.

NumberFormat throws a java.text.ParseException if it can't parse the string.

StreamTokenizer.nextToken

StreamTokenizer t =
    new StreamTokenizer(
        new StringReader( string ) );
t.resetSyntax( );
t.parseNumbers( );
try
{
    if ( t.nextToken( ) != t.TT_NUMBER )
    {
        ... do something about the error ...
    }
    else
    {
        int value = (int)t.nval;
    }
}
catch ( IOException e )
{
    ... do something about the error ...
}

To use java.io.StreamTokenzer, create an instance and pass in a StringReader (or any other reader) to supply the text to parse. Call parseNumbers( ) to tell the tokenizer to parse numbers, then call nextToken( ) to get the next number from the text. If the returned token type is TT_NUMBER, the value (as a double) will be in the tokenizer's public nval field.

StreamTokenizer returns a different token type if the input is not recognized as an integer, and it throws a java.io.IOException on the end of the input.

Using StreamTokenizer to parse a single number from a string is silly, of course. I provide it here only for comparison purposes.

Parsing an integer from a string with custom code

Parsing an integer isn't difficult. There are only 10 digit characters allowed, plus an optional leading minus sign. To parse one yourself, loop through the characters of a String. Convert each digit character to a number between 0 and 9, and accumulate the digits in an int.

The following custom parser adheres to the same contract as Integer.parseInt( ). Quoting from its javadoc:

"Parses the string argument as a signed decimal integer. The characters in the string must all be decimal digits, except that the first character may be an ASCII minus sign '-' ('\u002D') to indicate a negative value. The resulting integer value is returned, exactly as if the argument and the radix 10 were given as arguments to the parseInt(java.lang.String, int) method."

public static int parseInt( final String s )
{
    if ( string == null )
        throw new NumberFormatException( "Null string" );

    // Check for a sign.
    int num  = 0;
    int sign = -1;
    final int len  = s.length( );
    final char ch  = s.charAt( 0 );
    if ( ch == '-' )
    {
        if ( len == 1 )
            throw new NumberFormatException( "Missing digits:  " + s );
        sign = 1;
    }
    else
    {
        final int d = ch - '0';
        if ( d < 0 || d > 9 )
            throw new NumberFormatException( "Malformed:  " + s );
        num = -d;
    }

    // Build the number.
    final int max = (sign == -1) ?
        -Integer.MAX_VALUE : Integer.MIN_VALUE;
    final int multmax = max / 10;
    int i = 1;
    while ( i < len )
    {
        int d = s.charAt(i++) - '0';
        if ( d < 0 || d > 9 )
            throw new NumberFormatException( "Malformed:  " + s );
        if ( num < multmax )
            throw new NumberFormatException( "Over/underflow:  " + s );
        num *= 10;
        if ( num < (max+d) )
            throw new NumberFormatException( "Over/underflow:  " + s );
        num -= d;
    }

    return sign * num;
}

To convert a character digit to a number, you could call Character.digit( ) and pass in the character and radix (base). For instance, digit('8',10) returns 8, and digit('B',16) returns 11. This is what the JDK's implementation of Integer.parseInt( ) does.

The digit( ) method is general, but for the common radix 10 case there is a faster way. Java characters are encoded using Unicode which assigns digit characters consecutive code points. So, to convert a digit character to a number, simply subtract the '0' character.

int d = s.charAt(i) - '0';

For each character in the string, multiply the accumulated value by 10 and add the next digit. If a digit is not 0 through 9, throw an exception. We also have to watch out for integer overflow and underflow. This works better if we accumulate as negative numbers, then negate afterwards based on the presence or absence of a leading '-'.

All the digit and underflow/overflow checking takes time. If you're parsing numbers where you know they are valid, you can skip all of this and go faster. This is safe to do when parsing well-known file formats and data sets, such as the text files for the US Census. Such files have already been vetted for bad digits and underflow/overflow. So, here's the same function stripped of its exception checking:

public static int parseInt( final String s )
{
    // Check for a sign.
    int num  = 0;
    int sign = -1;
    final int len  = s.length( );
    final char ch  = s.charAt( 0 );
    if ( ch == '-' )
        sign = 1;
    else
        num = '0' - ch;

    // Build the number.
    int i = 1;
    while ( i < len )
        num = num*10 + '0' - s.charAt( i++ );

    return sign * num;
} 

Benchmarks

I benchmarked each of these approaches reading 1,000,000 random integers. The results plotted below are from an unloaded Mac with JVM 5, but similar results were found on a Windows PC and JVM 6. The JVM was run with the following options:

java -server -XX:CompileThreshold=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods Test

The -server option uses larger memory defaults and a parallel garbage collector to improve performance. The -XX:CompileThreshold=2 option forced the JVM to compile methods after just two calls, insuring that compilation occurred early, and not in the middle of the benchmarks. The -XX:+AggressiveOpts option enables fancier optimizations, and -XX:+UseFastAccessorMethods inlines some get( ) methods.

Measured times vary from processor to processor, so I've converted times to relative percentages for more general comparisons.

StreamTokenizer 100.0%
NumberFormat 81.1%
Integer 20.1%
Custom 6.7%
Custom nocheck 5.2%

Benchmark results of Java integer parsing

Using a StreamTokenizer to parse a single integer from a String is clearly convoluted. Not surprisingly, it is slower than any of the other methods.

What is perhaps surprising is that Integer takes 1/4th the time to parse an integer compared to NumberFormat. The performance difference comes from NumberFormat's more general parser that uses a lexical pattern to control the number of digits and the presence of special characters for currency signs, percentages, thousdands separators, etc. This extra generality is useful, but expensive.

The custom integer parser, with digit and overflow/underflow checking, is faster still at about 1/3 the time used by Integer. The performance difference is a result of two implementation differences between the custom parser and Integer's parser:

  • Integer.parseInt(string) calls the more general Integer.parseInt(string,radix) with a radix of 10. Unfortunately, the more general method's handling of non-10 radixes slows it down. Since the vast majority of integers are radix 10, Java's developers should have provided an accelerated version of parseInt( ) for radix 10.
  • Integer.parseInt(string,radix) uses Character.digit(ch,radix) to convert character digits to numbers. This is necessary for general handling of non-10 radixes, but it's unnecessarily slow for the simpler radix 10 case.

Together, these two slower implementation choices by Java's developers slow down Integer.parseInt( ) for standard radix 10 numbers. The custom parser above avoids these problems and goes much faster.

For the special case where the numbers to be parsed are known to be valid, the custom parser without digit and overflow/underflow checking goes a little faster. The speed difference is probably not worth the risk, though.

Conclusions

For the best Java integer parsing performance:

  • Integer.parseInt is much faster than NumberFormat.parse.
  • A custom parser for radix 10 numbers is much faster than Integer.parseInt.

If you have a large number of integers to parse, writing your own parser (or using my code above) can provide a significant performance boost. Just because Java's packages provide a lot of useful methods doesn't mean that those methods have been implemented well. Often, with your better knowledge of what you need to accomplish, you can do better with custom code.

Comments

This is a wonderful piece!

This is a wonderful piece! Thank You for explaining!

Good Explaination

this explanation is very good for developer for performance measurement.
thanks

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting