Re: [CSV] Performance
See the Line and FastLine classes in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout Examples module. You can see an older version of mahout here. This class hasn't changed in forever. https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahout/classifier/sgd/SimpleCsvExamples.java On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote: Thank you for sharing your experience Ted. Do you have a link to the code of your parser? I'd like to get a look. Currently the data flow in Commons CSV is: 1. Buffer the data in the BufferedReader 2. Accumulate data in a reusable buffer for the current token 3. Turn the token buffer into a String I was also thinking at something similar to reduce the string copies. The token from the CSVLexer could probably contain a CharSequence instead of a String. The CharSequence would be backed by the same array for all the fields of the record. Thus if a field isn't read by the user we don't pay the cost to convert it into a String. But this prevents the reuse of the buffer, and that means more work for the GC. Emmanuel Bourg Le 15/03/2012 15:49, Ted Dunning a écrit : I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be. The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine that operated on bytes to do the parsing from byte arrays. The parser passed around offsets only. Then when converting data, I converted directly from the original byte array into the target type. For the most common case (in my data) of converting to Integers, this eliminated masses of cons'ing and because the conversion was special purpose (I assumed UTF8 encoding and assumed that numbers could only use ASCII range digits), the conversion to integers was particularly fast. Overall, this made about a 20x difference in speed. This is not 20%; the final time was 5% of the original.
Re: [CSV] Performance
On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote: ... 1. Buffer the data in the BufferedReader 2. Accumulate data in a reusable buffer for the current token Reusable buffers are usually death in terms of subtle bugs and they rarely actually help that much. The key is to avoid copying. Cons'ing up and then collecting pointerized structures isn't that expensive. 3. Turn the token buffer into a String Also EVIL. Musn't convert to string unless that is really what you want. I was also thinking at something similar to reduce the string copies. The token from the CSVLexer could probably contain a CharSequence instead of a String. The CharSequence would be backed by the same array for all the fields of the record. Thus if a field isn't read by the user we don't pay the cost to convert it into a String. But this prevents the reuse of the buffer, and that means more work for the GC. Just moving around char costs twice as much as moving around bytes for most CSV data. I would avoid that if possible. I wouldn't worry about the GC. The experience in Hadoop and Lucene is that the effort made to avoid allocating light weight structures was very misguided. My own experiments have never shown a big benefit unless you conflate cons'ing the structures with copying lots of data. If you avoid the copy, the construction and collection of ephemeral structures turns out to be very nearly free. Emmanuel Bourg Le 15/03/2012 15:49, Ted Dunning a écrit : I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be. The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine that operated on bytes to do the parsing from byte arrays. The parser passed around offsets only. Then when converting data, I converted directly from the original byte array into the target type. For the most common case (in my data) of converting to Integers, this eliminated masses of cons'ing and because the conversion was special purpose (I assumed UTF8 encoding and assumed that numbers could only use ASCII range digits), the conversion to integers was particularly fast. Overall, this made about a 20x difference in speed. This is not 20%; the final time was 5% of the original.
Re: [CSV] Performance
Le 15/03/2012 13:34, sebb a écrit : In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. I suspect this is partly because the fetches are currently in loops rather than any getter overhead. Did you run with the client or the server VM? My tests were all with the server VM. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [CSV] Performance
Am 15. März 2012 13:50 schrieb Gary Gregory garydgreg...@gmail.com: Can you put your perf test code and resources in SVN so I do not have to write on please? Hi Gary, have a look at http://markmail.org/message/x73i3hl63rjqdyfa (I agree with you, that having a clean performance test in SVN would be better) Regards, Benedikt Gary On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote: In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. I suspect this is partly because the fetches are currently in loops rather than any getter overhead. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org JUnit in Action, 2nd Ed: http://goog_1249600977http://bit.ly/ECvg0 Spring Batch in Action: http://s.apache.org/HOqhttp://bit.ly/bqpbCK Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
On 15 March 2012 12:43, Emmanuel Bourg ebo...@apache.org wrote: Le 15/03/2012 13:34, sebb a écrit : In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. I suspect this is partly because the fetches are currently in loops rather than any getter overhead. Did you run with the client or the server VM? My tests were all with the server VM. Eclipse, so probably client VM? Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
Le 15/03/2012 14:13, sebb a écrit : Eclipse, so probably client VM? Probably. You can print the java.vm.name system property at the beginning of the test, that will tell you the VM used. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [CSV] Performance
I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be. The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine that operated on bytes to do the parsing from byte arrays. The parser passed around offsets only. Then when converting data, I converted directly from the original byte array into the target type. For the most common case (in my data) of converting to Integers, this eliminated masses of cons'ing and because the conversion was special purpose (I assumed UTF8 encoding and assumed that numbers could only use ASCII range digits), the conversion to integers was particularly fast. Overall, this made about a 20x difference in speed. This is not 20%; the final time was 5% of the original. On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote: In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. I suspect this is partly because the fetches are currently in loops rather than any getter overhead. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
Thank you for sharing your experience Ted. Do you have a link to the code of your parser? I'd like to get a look. Currently the data flow in Commons CSV is: 1. Buffer the data in the BufferedReader 2. Accumulate data in a reusable buffer for the current token 3. Turn the token buffer into a String I was also thinking at something similar to reduce the string copies. The token from the CSVLexer could probably contain a CharSequence instead of a String. The CharSequence would be backed by the same array for all the fields of the record. Thus if a field isn't read by the user we don't pay the cost to convert it into a String. But this prevents the reuse of the buffer, and that means more work for the GC. Emmanuel Bourg Le 15/03/2012 15:49, Ted Dunning a écrit : I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be. The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine that operated on bytes to do the parsing from byte arrays. The parser passed around offsets only. Then when converting data, I converted directly from the original byte array into the target type. For the most common case (in my data) of converting to Integers, this eliminated masses of cons'ing and because the conversion was special purpose (I assumed UTF8 encoding and assumed that numbers could only use ASCII range digits), the conversion to integers was particularly fast. Overall, this made about a 20x difference in speed. This is not 20%; the final time was 5% of the original. smime.p7s Description: S/MIME Cryptographic Signature
Re: [CSV] Performance
On 15 March 2012 13:17, Emmanuel Bourg ebo...@apache.org wrote: Le 15/03/2012 14:13, sebb a écrit : Eclipse, so probably client VM? Probably. You can print the java.vm.name system property at the beginning of the test, that will tell you the VM used. It was client. I've now tried with server, and it is slower when using class fields created by the ctor. Curious - perhaps it is a data localisation issue, though I would have thought the optimiser could have fetched those into local variables. So I then tried hauling the format method calls out of the loops into final local variables. This improves performance (slightly) in both client and server mode. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
Le 15/03/2012 16:45, sebb a écrit : So I then tried hauling the format method calls out of the loops into final local variables. This improves performance (slightly) in both client and server mode. Could you show some code please? I'm unable to reproduce this. I used local variables in simpleTokenLexer and the performance degrades by 14% (HotSpot server VM, JDK 6u31, Core 2 Duo). Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [CSV] Performance
On 15 March 2012 16:10, Emmanuel Bourg ebo...@apache.org wrote: Le 15/03/2012 16:45, sebb a écrit : So I then tried hauling the format method calls out of the loops into final local variables. This improves performance (slightly) in both client and server mode. Could you show some code please? Ditto. I'm unable to reproduce this. I used local variables in simpleTokenLexer and the performance degrades by 14% (HotSpot server VM, JDK 6u31, Core 2 Duo). I also used local vars in the other methods. See http://people.apache.org/~sebb/CSV/ Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
Le 15/03/2012 17:42, sebb a écrit : I also used local vars in the other methods. See http://people.apache.org/~sebb/CSV/ Thank you. You also reordered the if in simpleTokenLexer, that may explain the difference. I'll give it a try. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [CSV] Performance
On 15 March 2012 16:48, Emmanuel Bourg ebo...@apache.org wrote: Le 15/03/2012 17:42, sebb a écrit : I also used local vars in the other methods. See http://people.apache.org/~sebb/CSV/ Thank you. You also reordered the if in simpleTokenLexer, that may explain the difference. I'll give it a try. I'd forgotten about that; I thought I'd reverted that. If I revert it, I still get a better time with Lexer2, though not quite as good an improvement. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Performance
Le 15/03/2012 18:06, sebb a écrit : If I revert it, I still get a better time with Lexer2, though not quite as good an improvement. I ran my perf test, Lexer2 is slower on my system :( The order of the if doesn't change much here. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
After more experiments I'm less enthusiastic about providing an optimized BufferedReader. The result of the performance test is significantly different if the test is run alone or after all the other unit tests (about 30% slower). When all the tests are executed, the removal of the synchronized blocks in BufferedReader has no visible effect (maybe less than 1%), and the Harmony implementation becomes slower. Emmanuel Bourg Le 13/03/2012 10:20, Emmanuel Bourg a écrit : Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean that the performance of BufferedReader should be taken out of the equation then I agree. All CSV parsers should be compared with the same input source, otherwise the comparison isn't fair. Using Readable would be really nice, but that's very low level. We would have to build line reading and mark/reset on top of that, that's almost equivalent to reimplementing BufferedReader. If [io] could provide a BufferedReader implementation that: - takes a Readable in the constructor - does not synchronize reads - recognizes unicode line separators (and the classic ones) then I buy it right away! Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers ralph.go...@dslextreme.com wrote: I don't think we should be trying to recode JDK classes. If the implementations suck, why not? +1 -- http://www.grobmeier.de https://www.timeandbill.de - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 13 March 2012 09:01, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. And now Java has StringBuilder, which means StrBuilder is perhaps no longer necessary... Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean that the performance of BufferedReader should be taken out of the equation then I agree. All CSV parsers should be compared with the same input source, otherwise the comparison isn't fair. Using Readable would be really nice, but that's very low level. We would have to build line reading and mark/reset on top of that, that's almost equivalent to reimplementing BufferedReader. If [io] could provide a BufferedReader implementation that: - takes a Readable in the constructor - does not synchronize reads - recognizes unicode line separators (and the classic ones) then I buy it right away! Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 12 March 2012 10:31, Emmanuel Bourg ebo...@apache.org wrote: I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Java has a PushbackReader class - could that not be used? Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 16:44, sebb a écrit : Java has a PushbackReader class - could that not be used? I considered it, but it doesn't mix well with line reading. The mark/reset solution is really simple and efficient. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 12. März 2012 11:31 schrieb Emmanuel Bourg ebo...@apache.org: I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. great work Emmanuel! looking at my profiler, I can say that 70% of the time is spend in ExtendedBufferedReader.read(). This is no wonder, since read() is the method that does the actual work. However, we should try to minimize accesses to read(). For example isEndOfLine() calls read() two times. And isEndOfLine() get's called 5 times by CSVLexer.nextToken() and it's submethods. The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Benedikt Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of the escaping that aim at clarifying the parser. I think the nextToken() method should be broken into smaller methods to help the JIT compiler. The JIT does some surprising things, I found that even unused code branches can have an impact on the performance. For example if simpleTokenLexer() is changed to not support escaped characters, the performance improves by 10% (the input has no escaped character). And that's not merely because an if statement was removed. If I add a System.out.println() in this if block that is never called, the performance improves as well. So any change to the parser will have to be carefully tested. Innocent changes can have a significant impact. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Would one of the parser libraries not work here? On Mar 12, 2012 12:22 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of the escaping that aim at clarifying the parser. I think the nextToken() method should be broken into smaller methods to help the JIT compiler. The JIT does some surprising things, I found that even unused code branches can have an impact on the performance. For example if simpleTokenLexer() is changed to not support escaped characters, the performance improves by 10% (the input has no escaped character). And that's not merely because an if statement was removed. If I add a System.out.println() in this if block that is never called, the performance improves as well. So any change to the parser will have to be carefully tested. Innocent changes can have a significant impact. Emmanuel Bourg
Re: [csv] Performance comparison
Am 12. März 2012 17:22 schrieb Emmanuel Bourg ebo...@apache.org: Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of the escaping that aim at clarifying the parser. I think the nextToken() method should be broken into smaller methods to help the JIT compiler. I would start by eliminating the Token parameter. You could either create a new token on each method call and return that one instead of reusing on the gets passed in or you could use a private field currentToken in CSVLexer. But I think that object creation costs for a data object like Token can be considered irrelevant (so creating one in each method call will not hurt us). The JIT does some surprising things, I found that even unused code branches can have an impact on the performance. For example if simpleTokenLexer() is changed to not support escaped characters, the performance improves by 10% (the input has no escaped character). And that's not merely because an if statement was removed. If I add a System.out.println() in this if block that is never called, the performance improves as well. So any change to the parser will have to be carefully tested. Innocent changes can have a significant impact. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. +1 I did the same for my json lib... javacc et al are pretty complex. I still struggle to understand everything around ognl... if not necessary, my preference is always to leave such tools out. Emmanuel Bourg -- http://www.grobmeier.de https://www.timeandbill.de - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Yes this is what I mean. It might be worth a shot. Folks who specialize in parsing have spent much time on these libraries. It would make sense that they are quite fast. It gets us out of the parsing business. On Mar 12, 2012 12:41 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. Emmanuel Bourg
Re: [csv] Performance comparison
I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader also synchronized? I copied the code of BufferedReader and removed the synchronized blocks. Now the time to parse the file is down to 2652 ms, 28% faster than previously! Of course the code of BufferedReader can't be copied from the JDK due to the license mismatch, so I took the version from Harmony. On my test it is about 4% faster than the JDK counterpart, and the parsing time is now around 2553 ms. Now Commons CSV can start claiming being the fastest CSV parser around :) Emmanuel Bourg Le 12/03/2012 11:31, Emmanuel Bourg a écrit : I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote: I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader also synchronized? I copied the code of BufferedReader and removed the synchronized blocks. Now the time to parse the file is down to 2652 ms, 28% faster than previously! Of course the code of BufferedReader can't be copied from the JDK due to the license mismatch, so I took the version from Harmony. On my test it is about 4% faster than the JDK counterpart, and the parsing time is now around 2553 ms. I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. Now Commons CSV can start claiming being the fastest CSV parser around :) Emmanuel Bourg Le 12/03/2012 11:31, Emmanuel Bourg a écrit : I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On Mar 12, 2012, at 20:25, sebb seb...@gmail.com wrote: On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote: I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader also synchronized? I copied the code of BufferedReader and removed the synchronized blocks. Now the time to parse the file is down to 2652 ms, 28% faster than previously! Of course the code of BufferedReader can't be copied from the JDK due to the license mismatch, so I took the version from Harmony. On my test it is about 4% faster than the JDK counterpart, and the parsing time is now around 2553 ms. I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. +1 Gary Now Commons CSV can start claiming being the fastest CSV parser around :) Emmanuel Bourg Le 12/03/2012 11:31, Emmanuel Bourg a écrit : I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Mar 12, 2012, at 20:30, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? That would be better but we need to think twice before adding code. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? I don't think we should be trying to recode JDK classes. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. Niall [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On 13 March 2012 01:47, Niall Pemberton niall.pember...@gmail.com wrote: On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. +1, excellent suggestion. Niall [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Mar 12, 2012, at 5:44 PM, sebb wrote: On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? I don't think we should be trying to recode JDK classes. If the implementations suck, why not? Ralph
Re: [csv] Performance comparison
Am 11. März 2012 15:05 schrieb Emmanuel Bourg ebo...@apache.org: Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Hey Emmanuel, I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Regards, Benedikt Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking at the source of CSVParser if anything catch your eyes, and then run a profiler to try and identify the performance critical parts that could be improved. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 11. März 2012 21:21 schrieb Emmanuel Bourg ebo...@apache.org: Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking at the source of CSVParser if anything catch your eyes, and then run a profiler to try and identify the performance critical parts that could be improved. Hi Emmanuel, I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. And how about some performance junit tests? They may not be as accurate as a profiler, but they can give you a feeling, whether you are on the right way. Benedikt Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 00:02, Benedikt Ritter a écrit : I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. Usually I work with JProfiler, it identifies the hotspots pretty well, but I'm not sure if it will produce relevant results on the complex methods of CSVLexer. And how about some performance junit tests? They may not be as accurate as a profiler, but they can give you a feeling, whether you are on the right way. I wrote a quick test locally, but that's not clean enough to be committed. It looks like this: public class PerformanceTest extends TestCase { private int max = 10; private BufferedReader getReader() throws IOException { return new BufferedReader(new FileReader(worldcitiespop.txt)); } public void testReadBigFile() throws Exception { for (int i = 0; i max; i++) { BufferedReader in = getReader(); long t0 = System.currentTimeMillis(); int count = readAll(in); in.close(); System.out.println(File read in + (System.currentTimeMillis() - t0) + ms ++ count + lines); } System.out.println(); } private int readAll(BufferedReader in) throws IOException { int count = 0; while (in.readLine() != null) { count++; } return count; } public void testParseBigFile() throws Exception { for (int i = 0; i max; i++) { long t0 = System.currentTimeMillis(); int count = parseCommonsCSV(getReader()); System.out.println(File parsed in + (System.currentTimeMillis() - t0) + ms with Commons CSV ++ count + lines); } System.out.println(); } private int parseCommonsCSV(Reader in) { CSVFormat format = CSVFormat.DEFAULT.withSurroundingSpacesIgnored(false); int count = 0; for (String[] record : format.parse(in)) { count++; } return count; } } Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature