An experiment: splitting strings repeatedly

I recently had to process a huge text file: 5 GB, 92 million lines. Each line had to be split into three parts separated by commas. After that the program parses one of the components as an integer, compares another component, computes a hash code and increments an integer in a 16 M-element array.

At first I naively used String.split() to split the string, with a one-character pattern (","), meaning that the pattern is compiled for each line, here 92 million times. And then I compiled the regular expression once and for all:

Pattern commas = Pattern.compile(",");

// and later
String components = commas.split(line);

I also tried to use a StringTokenizer:

StringTokenizer t = new StringTokenizer(line, ",");

// and then use t.countTokens() and t.nextToken()

And finally I wrote a hand-crafted parser, optimized as much as I could.

The results are as follows, using an Unibody MacBook Pro, 2.66 GHz Intel Core 2 Duo:

MethodAverage execution time
Naive split()160 s
Pre-compiled pattern and split()130 s
StringTokenizer160 s
Hand-crafted parser130 s

Conclusions

Pattern compilation done by String.split() behind the scenes is very costly. Okay, here it's on average “only” 0.3 µs for compiling a one-character pattern. But repeated 92 million times, it increased the execution time of the whole program by 23%!

In conclusion, String.split() is very convenient when using a pattern once or a few times. But for any repeated processing, the way to go is to compile the pattern with Pattern.compile() first. Forget about StringTokenizer (it's inefficient and officially discouraged) and hand-crafted parsers (there's no clear benefit).