Java’s String class offers a very convenient split() method. In a pinch, you can write for instance
String components = myString.split(",");
to split a string into smaller substrings delimited by commas. So easy that one is prone to forget that the parameter of split() is not a string, but a regular expression. Powerful, but it comes at a cost: the regular expression must be compiled into a recognizer (into a Pattern in Java parlance) before it can be used to actually split the string. And the compilation step costs a lot.
I recently had to process a huge text file: 5 GB, 92 million lines. Each line had to be split into three parts separated by commas. After that the program parses one of the components as an integer, compares another component, computes a hash code and increments an integer in a 16 M-element array.
At first I naively used String.split() to split the string, with a one-character pattern (","), meaning that the pattern is compiled for each line, here 92 million times. And then I compiled the regular expression once and for all:
Pattern commas = Pattern.compile(",");
// and later
String components = commas.split(line);
I also tried to use a StringTokenizer:
StringTokenizer t = new StringTokenizer(line, ",");
// and then use t.countTokens() and t.nextToken()
And finally I wrote a hand-crafted parser, optimized as much as I could.
The results are as follows, using an Unibody MacBook Pro, 2.66 GHz Intel Core 2 Duo:
Method | Average execution time |
---|---|
Naive split() | 160 s |
Pre-compiled pattern and split() | 130 s |
StringTokenizer | 160 s |
Hand-crafted parser | 130 s |
Pattern compilation done by String.split() behind the scenes is very costly. Okay, here it’s on average “only” 0.3 µs for compiling a one-character pattern. But repeated 92 million times, it increased the execution time of the whole program by 23%!
In conclusion, String.split() is very convenient when using a pattern once or a few times. But for any repeated processing, the way to go is to compile the pattern with Pattern.compile() first. Forget about StringTokenizer (it’s inefficient and officially discouraged) and hand-crafted parsers (there’s no clear benefit).
Valid HTML5? © Christophe Jacquet. ✍ Contact. Imprint (Mentions légales).