Re: [CSV] Performance

2012-03-16 Thread Ted Dunning
See the Line and FastLine classes
in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout
Examples module.

You can see an older version of mahout here.  This class hasn't changed in
forever.

https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahout/classifier/sgd/SimpleCsvExamples.java

On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Thank you for sharing your experience Ted. Do you have a link to the code
 of your parser? I'd like to get a look.

 Currently the data flow in Commons CSV is:

 1. Buffer the data in the BufferedReader
 2. Accumulate data in a reusable buffer for the current token
 3. Turn the token buffer into a String

 I was also thinking at something similar to reduce the string copies. The
 token from the CSVLexer could probably contain a CharSequence instead of a
 String. The CharSequence would be backed by the same array for all the
 fields of the record. Thus if a field isn't read by the user we don't pay
 the cost to convert it into a String. But this prevents the reuse of the
 buffer, and that means more work for the GC.

 Emmanuel Bourg


 Le 15/03/2012 15:49, Ted Dunning a écrit :

 I built a limited CSV package for parsing data in Mahout at one point.  I
 doubt that it was general enough to be helpful here, but the experience
 might be.

 The thing that *really* made a big difference in speed was to avoid copies
 and conversions to String.  To do that, I built a state machine that
 operated on bytes to do the parsing from byte arrays.  The parser passed
 around offsets only.  Then when converting data, I converted directly from
 the original byte array into the target type.  For the most common case
 (in
 my data) of converting to Integers, this eliminated masses of cons'ing and
 because the conversion was special purpose (I assumed UTF8 encoding and
 assumed that numbers could only use ASCII range digits), the conversion to
 integers was particularly fast.

 Overall, this made about a 20x difference in speed.  This is not 20%; the
 final time was 5% of the original.





Re: [CSV] Performance

2012-03-16 Thread Ted Dunning
On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote:

 ...
 1. Buffer the data in the BufferedReader
 2. Accumulate data in a reusable buffer for the current token


Reusable buffers are usually death in terms of subtle bugs and they rarely
actually help that much.  The key is to avoid copying.  Cons'ing up and
then collecting pointerized structures isn't that expensive.

3. Turn the token buffer into a String


Also EVIL.  Musn't convert to string unless that is really what you want.


 I was also thinking at something similar to reduce the string copies. The
 token from the CSVLexer could probably contain a CharSequence instead of a
 String. The CharSequence would be backed by the same array for all the
 fields of the record. Thus if a field isn't read by the user we don't pay
 the cost to convert it into a String. But this prevents the reuse of the
 buffer, and that means more work for the GC.


Just moving around char costs twice as much as moving around bytes for most
CSV data.  I would avoid that if possible.

I wouldn't worry about the GC.  The experience in Hadoop and Lucene is that
the effort made to avoid allocating light weight structures was very
misguided.  My own experiments have never shown a big benefit unless you
conflate cons'ing the structures with copying lots of data.  If you avoid
the copy, the construction and collection of ephemeral structures turns out
to be very nearly free.


 Emmanuel Bourg


 Le 15/03/2012 15:49, Ted Dunning a écrit :

 I built a limited CSV package for parsing data in Mahout at one point.  I
 doubt that it was general enough to be helpful here, but the experience
 might be.

 The thing that *really* made a big difference in speed was to avoid copies
 and conversions to String.  To do that, I built a state machine that
 operated on bytes to do the parsing from byte arrays.  The parser passed
 around offsets only.  Then when converting data, I converted directly from
 the original byte array into the target type.  For the most common case
 (in
 my data) of converting to Integers, this eliminated masses of cons'ing and
 because the conversion was special purpose (I assumed UTF8 encoding and
 assumed that numbers could only use ASCII range digits), the conversion to
 integers was particularly fast.

 Overall, this made about a 20x difference in speed.  This is not 20%; the
 final time was 5% of the original.





Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Le 15/03/2012 13:34, sebb a écrit :

In my testing, using final class variables for delimiter, escape etc
(set in ctor) shaves about 1 sec off the time to read the world town
data file compared with accessing these fields inline through the
format field.

Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

I suspect this is partly because the fetches are currently in loops
rather than any getter overhead.


Did you run with the client or the server VM? My tests were all with the 
server VM.



Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [CSV] Performance

2012-03-15 Thread Benedikt Ritter
Am 15. März 2012 13:50 schrieb Gary Gregory garydgreg...@gmail.com:
 Can you put your perf test code and resources in SVN so I do not have to
 write on please?


Hi Gary,

have a look at http://markmail.org/message/x73i3hl63rjqdyfa (I agree
with you, that having a clean performance test in SVN would be better)

Regards,
Benedikt

 Gary

 On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote:

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.

 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 JUnit in Action, 2nd Ed: http://goog_1249600977http://bit.ly/ECvg0
 Spring Batch in Action: http://s.apache.org/HOqhttp://bit.ly/bqpbCK
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 12:43, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 13:34, sebb a écrit :

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.


 Did you run with the client or the server VM? My tests were all with the
 server VM.

Eclipse, so probably client VM?


 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Le 15/03/2012 14:13, sebb a écrit :


Eclipse, so probably client VM?


Probably. You can print the java.vm.name system property at the 
beginning of the test, that will tell you the VM used.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [CSV] Performance

2012-03-15 Thread Ted Dunning
I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.

On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote:

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.

 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Thank you for sharing your experience Ted. Do you have a link to the 
code of your parser? I'd like to get a look.


Currently the data flow in Commons CSV is:

1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a String

I was also thinking at something similar to reduce the string copies. 
The token from the CSVLexer could probably contain a CharSequence 
instead of a String. The CharSequence would be backed by the same array 
for all the fields of the record. Thus if a field isn't read by the user 
we don't pay the cost to convert it into a String. But this prevents the 
reuse of the buffer, and that means more work for the GC.


Emmanuel Bourg


Le 15/03/2012 15:49, Ted Dunning a écrit :

I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 13:17, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 14:13, sebb a écrit :


 Eclipse, so probably client VM?


 Probably. You can print the java.vm.name system property at the beginning of
 the test, that will tell you the VM used.

It was client.

I've now tried with server, and it is slower when using class fields
created by the ctor.
Curious - perhaps it is a data localisation issue, though I would have
thought the optimiser could have fetched those into local variables.

So I then tried hauling the format method calls out of the loops into
final local variables.
This improves performance (slightly) in both client and server mode.

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Le 15/03/2012 16:45, sebb a écrit :


So I then tried hauling the format method calls out of the loops into
final local variables.
This improves performance (slightly) in both client and server mode.


Could you show some code please? I'm unable to reproduce this. I used 
local variables in simpleTokenLexer and the performance degrades by 14% 
(HotSpot server VM, JDK 6u31, Core 2 Duo).


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 16:10, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 16:45, sebb a écrit :


 So I then tried hauling the format method calls out of the loops into
 final local variables.
 This improves performance (slightly) in both client and server mode.


 Could you show some code please?

Ditto.

 I'm unable to reproduce this. I used local
 variables in simpleTokenLexer and the performance degrades by 14% (HotSpot
 server VM, JDK 6u31, Core 2 Duo).

I also used local vars in the other methods.

See http://people.apache.org/~sebb/CSV/

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Le 15/03/2012 17:42, sebb a écrit :


I also used local vars in the other methods.

See http://people.apache.org/~sebb/CSV/


Thank you. You also reordered the if in simpleTokenLexer, that may 
explain the difference. I'll give it a try.


Emmanuel Bourg




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 16:48, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 17:42, sebb a écrit :


 I also used local vars in the other methods.

 See http://people.apache.org/~sebb/CSV/


 Thank you. You also reordered the if in simpleTokenLexer, that may explain
 the difference. I'll give it a try.

I'd forgotten about that; I thought I'd reverted that.

If I revert it, I still get a better time with Lexer2, though not
quite as good an improvement.

 Emmanuel Bourg



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Le 15/03/2012 18:06, sebb a écrit :


If I revert it, I still get a better time with Lexer2, though not
quite as good an improvement.


I ran my perf test, Lexer2 is slower on my system :( The order of the if 
doesn't change much here.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-14 Thread Emmanuel Bourg
After more experiments I'm less enthusiastic about providing an 
optimized BufferedReader. The result of the performance test is 
significantly different if the test is run alone or after all the other 
unit tests (about 30% slower). When all the tests are executed, the 
removal of the synchronized blocks in BufferedReader has no visible 
effect (maybe less than 1%), and the Harmony implementation becomes slower.


Emmanuel Bourg


Le 13/03/2012 10:20, Emmanuel Bourg a écrit :

Le 13/03/2012 02:47, Niall Pemberton a écrit :


IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.


I you mean that the performance of BufferedReader should be taken out of
the equation then I agree. All CSV parsers should be compared with the
same input source, otherwise the comparison isn't fair.

Using Readable would be really nice, but that's very low level. We would
have to build line reading and mark/reset on top of that, that's almost
equivalent to reimplementing BufferedReader.

If [io] could provide a BufferedReader implementation that:
- takes a Readable in the constructor
- does not synchronize reads
- recognizes unicode line separators (and the classic ones)

then I buy it right away!

Emmanuel Bourg






smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-14 Thread Christian Grobmeier
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers ralph.go...@dslextreme.com wrote:
 I don't think we should be trying to recode JDK classes.

 If the implementations suck, why not?

+1


-- 
http://www.grobmeier.de
https://www.timeandbill.de

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg

Le 13/03/2012 01:44, sebb a écrit :


I don't think we should be trying to recode JDK classes.


I'd rather not, but we have done that in the past. FastDateFormat and 
StrBuilder come to mind.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-13 Thread sebb
On 13 March 2012 09:01, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 01:44, sebb a écrit :


 I don't think we should be trying to recode JDK classes.


 I'd rather not, but we have done that in the past. FastDateFormat and
 StrBuilder come to mind.

And now Java has StringBuilder, which means StrBuilder is perhaps no
longer necessary...

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg

Le 13/03/2012 02:47, Niall Pemberton a écrit :


IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.


I you mean that the performance of BufferedReader should be taken out of 
the equation then I agree. All CSV parsers should be compared with the 
same input source, otherwise the comparison isn't fair.


Using Readable would be really nice, but that's very low level. We would 
have to build line reading and mark/reset on top of that, that's almost 
equivalent to reimplementing BufferedReader.


If [io] could provide a BufferedReader implementation that:
- takes a Readable in the constructor
- does not synchronize reads
- recognizes unicode line separators (and the classic ones)

then I buy it right away!

Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
I have identified the performance killer, it's the 
ExtendedBufferedReader. It implements a complex logic to fetch one 
character ahead, but this extra character is rarely used. I have 
implemented a simpler look ahead using mark/reset as suggested by Bob 
Smith in CSV-42 and the performance improved by 30%.


Now the parsing is down to 3406 ms, and that's almost without touching 
the parser yet.


Emmanuel Bourg


Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

Hi,

I compared the performance of Commons CSV with the other CSV parsers
available. I took the world cities file from Maxmind as a test file [1],
it's a big file of 130M with 2.8 million records.

Here are the results obtained on a Core 2 Duo E8400 after several
iterations to let the JIT compiler kick in:

Direct read 750 ms
Java CSV 3328 ms
Super CSV 3562 ms (+7%)
OpenCSV 3609 ms (+8.4%)
GenJava CSV 3844 ms (+15.5%)
Commons CSV 4656 ms (+39.9%)
Skife CSV 4813 ms (+44.6%)

I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
them.

I haven't analyzed why Commons CSV is slower yet, but it seems there is
room for improvements. The memory usage will have to be compared too,
I'm looking for a way to measure it.


Emmanuel Bourg

[1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz






smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 12 March 2012 10:31, Emmanuel Bourg ebo...@apache.org wrote:
 I have identified the performance killer, it's the ExtendedBufferedReader.
 It implements a complex logic to fetch one character ahead, but this extra
 character is rarely used. I have implemented a simpler look ahead using
 mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
 by 30%.

Java has a PushbackReader class - could that not be used?

 Now the parsing is down to 3406 ms, and that's almost without touching the
 parser yet.

 Emmanuel Bourg


 Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

 Hi,

 I compared the performance of Commons CSV with the other CSV parsers
 available. I took the world cities file from Maxmind as a test file [1],
 it's a big file of 130M with 2.8 million records.

 Here are the results obtained on a Core 2 Duo E8400 after several
 iterations to let the JIT compiler kick in:

 Direct read 750 ms
 Java CSV 3328 ms
 Super CSV 3562 ms (+7%)
 OpenCSV 3609 ms (+8.4%)
 GenJava CSV 3844 ms (+15.5%)
 Commons CSV 4656 ms (+39.9%)
 Skife CSV 4813 ms (+44.6%)

 I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
 them.

 I haven't analyzed why Commons CSV is slower yet, but it seems there is
 room for improvements. The memory usage will have to be compared too,
 I'm looking for a way to measure it.


 Emmanuel Bourg

 [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz




-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg

Le 12/03/2012 16:44, sebb a écrit :


Java has a PushbackReader class - could that not be used?


I considered it, but it doesn't mix well with line reading. The 
mark/reset solution is really simple and efficient.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 11:31 schrieb Emmanuel Bourg ebo...@apache.org:
 I have identified the performance killer, it's the ExtendedBufferedReader.
 It implements a complex logic to fetch one character ahead, but this extra
 character is rarely used. I have implemented a simpler look ahead using
 mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
 by 30%.

 Now the parsing is down to 3406 ms, and that's almost without touching the
 parser yet.


great work Emmanuel!

looking at my profiler, I can say that 70% of the time is spend in
ExtendedBufferedReader.read(). This is no wonder, since read() is the
method that does the actual work. However, we should try to minimize
accesses to read(). For example isEndOfLine() calls read() two times.
And isEndOfLine() get's called 5 times by CSVLexer.nextToken() and
it's submethods.
The hole logic behind CSVLexer.nextToken() is very hard to read
(IMHO). Maybe a some refactoring would help to make it easier to
identify bottle necks?

Benedikt

 Emmanuel Bourg


 Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

 Hi,

 I compared the performance of Commons CSV with the other CSV parsers
 available. I took the world cities file from Maxmind as a test file [1],
 it's a big file of 130M with 2.8 million records.

 Here are the results obtained on a Core 2 Duo E8400 after several
 iterations to let the JIT compiler kick in:

 Direct read 750 ms
 Java CSV 3328 ms
 Super CSV 3562 ms (+7%)
 OpenCSV 3609 ms (+8.4%)
 GenJava CSV 3844 ms (+15.5%)
 Commons CSV 4656 ms (+39.9%)
 Skife CSV 4813 ms (+44.6%)

 I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
 them.

 I haven't analyzed why Commons CSV is slower yet, but it seems there is
 room for improvements. The memory usage will have to be compared too,
 I'm looking for a way to measure it.


 Emmanuel Bourg

 [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz




-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg

Le 12/03/2012 17:03, Benedikt Ritter a écrit :


The hole logic behind CSVLexer.nextToken() is very hard to read
(IMHO). Maybe a some refactoring would help to make it easier to
identify bottle necks?


Yes I started investigating in this direction. I filed a few bugs 
regarding the behavior of the escaping that aim at clarifying the parser.


I think the nextToken() method should be broken into smaller methods to 
help the JIT compiler.


The JIT does some surprising things, I found that even unused code 
branches can have an impact on the performance. For example if 
simpleTokenLexer() is changed to not support escaped characters, the 
performance improves by 10% (the input has no escaped character). And 
that's not merely because an if statement was removed. If I add a 
System.out.println() in this if block that is never called, the 
performance improves as well.


So any change to the parser will have to be carefully tested. Innocent 
changes can have a significant impact.



Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Would one of the parser libraries not work here?
On Mar 12, 2012 12:22 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 12/03/2012 17:03, Benedikt Ritter a écrit :

  The hole logic behind CSVLexer.nextToken() is very hard to read
 (IMHO). Maybe a some refactoring would help to make it easier to
 identify bottle necks?


 Yes I started investigating in this direction. I filed a few bugs
 regarding the behavior of the escaping that aim at clarifying the parser.

 I think the nextToken() method should be broken into smaller methods to
 help the JIT compiler.

 The JIT does some surprising things, I found that even unused code
 branches can have an impact on the performance. For example if
 simpleTokenLexer() is changed to not support escaped characters, the
 performance improves by 10% (the input has no escaped character). And
 that's not merely because an if statement was removed. If I add a
 System.out.println() in this if block that is never called, the performance
 improves as well.

 So any change to the parser will have to be carefully tested. Innocent
 changes can have a significant impact.


 Emmanuel Bourg




Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 17:22 schrieb Emmanuel Bourg ebo...@apache.org:
 Le 12/03/2012 17:03, Benedikt Ritter a écrit :


 The hole logic behind CSVLexer.nextToken() is very hard to read
 (IMHO). Maybe a some refactoring would help to make it easier to
 identify bottle necks?


 Yes I started investigating in this direction. I filed a few bugs regarding
 the behavior of the escaping that aim at clarifying the parser.

 I think the nextToken() method should be broken into smaller methods to help
 the JIT compiler.


I would start by eliminating the Token parameter. You could either
create a new token on each method call and return that one instead of
reusing on the gets passed in or you could use a private field
currentToken in CSVLexer. But I think that object creation costs for a
data object like Token can be considered irrelevant (so creating one
in each method call will not hurt us).

 The JIT does some surprising things, I found that even unused code branches
 can have an impact on the performance. For example if simpleTokenLexer() is
 changed to not support escaped characters, the performance improves by 10%
 (the input has no escaped character). And that's not merely because an if
 statement was removed. If I add a System.out.println() in this if block that
 is never called, the performance improves as well.

 So any change to the parser will have to be carefully tested. Innocent
 changes can have a significant impact.


 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg

Le 12/03/2012 17:28, James Carman a écrit :

Would one of the parser libraries not work here?


You think at something like JavaCC or AntLR? Not sure it'll be more 
efficient than a handcrafted parser. The CSV format is simple enough to 
do it manually.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread Christian Grobmeier
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg ebo...@apache.org wrote:
 Le 12/03/2012 17:28, James Carman a écrit :

 Would one of the parser libraries not work here?


 You think at something like JavaCC or AntLR? Not sure it'll be more
 efficient than a handcrafted parser. The CSV format is simple enough to do
 it manually.

+1

I did the same for my json lib... javacc et al are pretty complex. I
still struggle to understand everything around ognl...
if not necessary, my preference is always to leave such tools out.


 Emmanuel Bourg




-- 
http://www.grobmeier.de
https://www.timeandbill.de

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Yes this is what I mean.  It might be worth a shot.  Folks who specialize
in parsing have spent much time on these libraries.  It would make sense
that they are quite fast.  It gets us out of the parsing business.
On Mar 12, 2012 12:41 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 12/03/2012 17:28, James Carman a écrit :

 Would one of the parser libraries not work here?


 You think at something like JavaCC or AntLR? Not sure it'll be more
 efficient than a handcrafted parser. The CSV format is simple enough to do
 it manually.

 Emmanuel Bourg




Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg

I kept tickling ExtendedBufferedReader and I have some interesting results.

First I tried to simplify it by extending java.io.LineNumberReader 
instead of BufferedReader. The performance decreased by 20%, probably 
because the class is synchronized internally.


But wait, isn't BufferedReader also synchronized? I copied the code of 
BufferedReader and removed the synchronized blocks. Now the time to 
parse the file is down to 2652 ms, 28% faster than previously!


Of course the code of BufferedReader can't be copied from the JDK due to 
the license mismatch, so I took the version from Harmony. On my test it 
is about 4% faster than the JDK counterpart, and the parsing time is now 
around 2553 ms.


Now Commons CSV can start claiming being the fastest CSV parser around :)

Emmanuel Bourg


Le 12/03/2012 11:31, Emmanuel Bourg a écrit :

I have identified the performance killer, it's the
ExtendedBufferedReader. It implements a complex logic to fetch one
character ahead, but this extra character is rarely used. I have
implemented a simpler look ahead using mark/reset as suggested by Bob
Smith in CSV-42 and the performance improved by 30%.

Now the parsing is down to 3406 ms, and that's almost without touching
the parser yet.

Emmanuel Bourg


Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

Hi,

I compared the performance of Commons CSV with the other CSV parsers
available. I took the world cities file from Maxmind as a test file [1],
it's a big file of 130M with 2.8 million records.

Here are the results obtained on a Core 2 Duo E8400 after several
iterations to let the JIT compiler kick in:

Direct read 750 ms
Java CSV 3328 ms
Super CSV 3562 ms (+7%)
OpenCSV 3609 ms (+8.4%)
GenJava CSV 3844 ms (+15.5%)
Commons CSV 4656 ms (+39.9%)
Skife CSV 4813 ms (+44.6%)

I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
them.

I haven't analyzed why Commons CSV is slower yet, but it seems there is
room for improvements. The memory usage will have to be compared too,
I'm looking for a way to measure it.


Emmanuel Bourg

[1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz









smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote:
 I kept tickling ExtendedBufferedReader and I have some interesting results.

 First I tried to simplify it by extending java.io.LineNumberReader instead
 of BufferedReader. The performance decreased by 20%, probably because the
 class is synchronized internally.

 But wait, isn't BufferedReader also synchronized? I copied the code of
 BufferedReader and removed the synchronized blocks. Now the time to parse
 the file is down to 2652 ms, 28% faster than previously!

 Of course the code of BufferedReader can't be copied from the JDK due to the
 license mismatch, so I took the version from Harmony. On my test it is about
 4% faster than the JDK counterpart, and the parsing time is now around 2553
 ms.

I'm concerned that the CSV code may grow and grow with private
versions of code that could be provided by the JDK.

By all means make sure the code is efficient in the way it uses the
JDK classes, but I don't think we should be recoding standard classes.

 Now Commons CSV can start claiming being the fastest CSV parser around :)

 Emmanuel Bourg


 Le 12/03/2012 11:31, Emmanuel Bourg a écrit :

 I have identified the performance killer, it's the
 ExtendedBufferedReader. It implements a complex logic to fetch one
 character ahead, but this extra character is rarely used. I have
 implemented a simpler look ahead using mark/reset as suggested by Bob
 Smith in CSV-42 and the performance improved by 30%.

 Now the parsing is down to 3406 ms, and that's almost without touching
 the parser yet.

 Emmanuel Bourg


 Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

 Hi,

 I compared the performance of Commons CSV with the other CSV parsers
 available. I took the world cities file from Maxmind as a test file [1],
 it's a big file of 130M with 2.8 million records.

 Here are the results obtained on a Core 2 Duo E8400 after several
 iterations to let the JIT compiler kick in:

 Direct read 750 ms
 Java CSV 3328 ms
 Super CSV 3562 ms (+7%)
 OpenCSV 3609 ms (+8.4%)
 GenJava CSV 3844 ms (+15.5%)
 Commons CSV 4656 ms (+39.9%)
 Skife CSV 4813 ms (+44.6%)

 I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
 them.

 I haven't analyzed why Commons CSV is slower yet, but it seems there is
 room for improvements. The memory usage will have to be compared too,
 I'm looking for a way to measure it.


 Emmanuel Bourg

 [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz






-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg

Le 13/03/2012 01:25, sebb a écrit :


I'm concerned that the CSV code may grow and grow with private
versions of code that could be provided by the JDK.

By all means make sure the code is efficient in the way it uses the
JDK classes, but I don't think we should be recoding standard classes.


I agree such a class should not live in [csv], but maybe in [io]?

Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:25, sebb seb...@gmail.com wrote:

 On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote:
 I kept tickling ExtendedBufferedReader and I have some interesting results.

 First I tried to simplify it by extending java.io.LineNumberReader instead
 of BufferedReader. The performance decreased by 20%, probably because the
 class is synchronized internally.

 But wait, isn't BufferedReader also synchronized? I copied the code of
 BufferedReader and removed the synchronized blocks. Now the time to parse
 the file is down to 2652 ms, 28% faster than previously!

 Of course the code of BufferedReader can't be copied from the JDK due to the
 license mismatch, so I took the version from Harmony. On my test it is about
 4% faster than the JDK counterpart, and the parsing time is now around 2553
 ms.

 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.

 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.

+1

Gary

 Now Commons CSV can start claiming being the fastest CSV parser around :)

 Emmanuel Bourg


 Le 12/03/2012 11:31, Emmanuel Bourg a écrit :

 I have identified the performance killer, it's the
 ExtendedBufferedReader. It implements a complex logic to fetch one
 character ahead, but this extra character is rarely used. I have
 implemented a simpler look ahead using mark/reset as suggested by Bob
 Smith in CSV-42 and the performance improved by 30%.

 Now the parsing is down to 3406 ms, and that's almost without touching
 the parser yet.

 Emmanuel Bourg


 Le 11/03/2012 15:05, Emmanuel Bourg a écrit :

 Hi,

 I compared the performance of Commons CSV with the other CSV parsers
 available. I took the world cities file from Maxmind as a test file [1],
 it's a big file of 130M with 2.8 million records.

 Here are the results obtained on a Core 2 Duo E8400 after several
 iterations to let the JIT compiler kick in:

 Direct read 750 ms
 Java CSV 3328 ms
 Super CSV 3562 ms (+7%)
 OpenCSV 3609 ms (+8.4%)
 GenJava CSV 3844 ms (+15.5%)
 Commons CSV 4656 ms (+39.9%)
 Skife CSV 4813 ms (+44.6%)

 I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
 them.

 I haven't analyzed why Commons CSV is slower yet, but it seems there is
 room for improvements. The memory usage will have to be compared too,
 I'm looking for a way to measure it.


 Emmanuel Bourg

 [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz






 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:30, Emmanuel Bourg ebo...@apache.org wrote:

 Le 13/03/2012 01:25, sebb a écrit :

 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.

 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.

 I agree such a class should not live in [csv], but maybe in [io]?

That would be better but we need to think twice before adding code.

Gary


 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 01:25, sebb a écrit :


 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.

 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.


 I agree such a class should not live in [csv], but maybe in [io]?

I don't think we should be trying to recode JDK classes.

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Niall Pemberton
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 01:25, sebb a écrit :


 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.

 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.


 I agree such a class should not live in [csv], but maybe in [io]?

IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.

Niall

[1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 01:47, Niall Pemberton niall.pember...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 01:25, sebb a écrit :


 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.

 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.


 I agree such a class should not live in [csv], but maybe in [io]?

 IMO performance should be taken out of the equation by using the
 Readable interface[1]. That way the users can use whatever
 implementation suits them (for example using an underlying buffered
 InputStream) to change/improve performance.

+1, excellent suggestion.

 Niall

 [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html

 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-12 Thread Ralph Goers

On Mar 12, 2012, at 5:44 PM, sebb wrote:

 On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 01:25, sebb a écrit :
 
 
 I'm concerned that the CSV code may grow and grow with private
 versions of code that could be provided by the JDK.
 
 By all means make sure the code is efficient in the way it uses the
 JDK classes, but I don't think we should be recoding standard classes.
 
 
 I agree such a class should not live in [csv], but maybe in [io]?
 
 I don't think we should be trying to recode JDK classes.

If the implementations suck, why not?

Ralph



Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 15:05 schrieb Emmanuel Bourg ebo...@apache.org:
 Hi,

 I compared the performance of Commons CSV with the other CSV parsers
 available. I took the world cities file from Maxmind as a test file [1],
 it's a big file of 130M with 2.8 million records.

 Here are the results obtained on a Core 2 Duo E8400 after several iterations
 to let the JIT compiler kick in:

 Direct read      750 ms
 Java CSV        3328 ms
 Super CSV       3562 ms  (+7%)
 OpenCSV         3609 ms  (+8.4%)
 GenJava CSV     3844 ms  (+15.5%)
 Commons CSV     4656 ms  (+39.9%)
 Skife CSV       4813 ms  (+44.6%)

 I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
 them.

 I haven't analyzed why Commons CSV is slower yet, but it seems there is room
 for improvements. The memory usage will have to be compared too, I'm looking
 for a way to measure it.


Hey Emmanuel,

I have some spare time to help you with this. I'll check out the
latest source tonight. Any suggestion where to start?

Regards,
Benedikt


 Emmanuel Bourg

 [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg

Le 11/03/2012 16:53, Benedikt Ritter a écrit :


I have some spare time to help you with this. I'll check out the
latest source tonight. Any suggestion where to start?


Hi Benedikt, thank you for helping. You can start looking at the source 
of CSVParser if anything catch your eyes, and then run a profiler to try 
and identify the performance critical parts that could be improved.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 21:21 schrieb Emmanuel Bourg ebo...@apache.org:
 Le 11/03/2012 16:53, Benedikt Ritter a écrit :


 I have some spare time to help you with this. I'll check out the
 latest source tonight. Any suggestion where to start?


 Hi Benedikt, thank you for helping. You can start looking at the source of
 CSVParser if anything catch your eyes, and then run a profiler to try and
 identify the performance critical parts that could be improved.


Hi Emmanuel,

I've started to dig my way through the source. I've not done too much
performance measuring in my career yet. I would use VisualVM for
profiling, if you don't know anything better.
And how about some performance junit tests? They may not be as
accurate as a profiler, but they can give you a feeling, whether you
are on the right way.

Benedikt

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg

Le 12/03/2012 00:02, Benedikt Ritter a écrit :


I've started to dig my way through the source. I've not done too much
performance measuring in my career yet. I would use VisualVM for
profiling, if you don't know anything better.


Usually I work with JProfiler, it identifies the hotspots pretty well, 
but I'm not sure if it will produce relevant results on the complex 
methods of CSVLexer.




And how about some performance junit tests? They may not be as
accurate as a profiler, but they can give you a feeling, whether you
are on the right way.


I wrote a quick test locally, but that's not clean enough to be 
committed. It looks like this:



public class PerformanceTest extends TestCase {

private int max = 10;

private BufferedReader getReader() throws IOException {
return new BufferedReader(new FileReader(worldcitiespop.txt));
}

public void testReadBigFile() throws Exception {
for (int i = 0; i  max; i++) {
BufferedReader in = getReader();
long t0 = System.currentTimeMillis();
int count = readAll(in);
in.close();
System.out.println(File read in  + 
(System.currentTimeMillis() - t0) + ms ++ count +  lines);

}
System.out.println();
}

private int readAll(BufferedReader in) throws IOException {
int count = 0;
while (in.readLine() != null) {
count++;
}

return count;
}

public void testParseBigFile() throws Exception {
for (int i = 0; i  max; i++) {
long t0 = System.currentTimeMillis();
int count = parseCommonsCSV(getReader());
System.out.println(File parsed in  + 
(System.currentTimeMillis() - t0) + ms with Commons CSV ++ count 
+  lines);

}
System.out.println();
}

private int parseCommonsCSV(Reader in) {
CSVFormat format = 
CSVFormat.DEFAULT.withSurroundingSpacesIgnored(false);


int count = 0;
for (String[] record : format.parse(in)) {
count++;
}

return count;
}
}


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature