subject:"Re\: \[CSV\] Performance"

Re: [CSV] Performance

2012-03-16 Thread Ted Dunning

See the Line and FastLine classes
in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout
Examples module.

You can see an older version of mahout here. This class hasn't changed in
forever.

https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahout/classifier/sgd/SimpleCsvExamples.java

On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote:

Thank you for sharing your experience Ted. Do you have a link to the code
of your parser? I'd like to get a look.

Currently the data flow in Commons CSV is:

1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a String

I was also thinking at something similar to reduce the string copies. The
token from the CSVLexer could probably contain a CharSequence instead of a
String. The CharSequence would be backed by the same array for all the
fields of the record. Thus if a field isn't read by the user we don't pay
the cost to convert it into a String. But this prevents the reuse of the
buffer, and that means more work for the GC.

Emmanuel Bourg

Le 15/03/2012 15:49, Ted Dunning a écrit :

I built a limited CSV package for parsing data in Mahout at one point. I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String. To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays. The parser passed
around offsets only. Then when converting data, I converted directly from
the original byte array into the target type. For the most common case
(in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed. This is not 20%; the
final time was 5% of the original.

Re: [CSV] Performance

2012-03-16 Thread Ted Dunning

On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg ebo...@apache.org wrote:

 ...
 1. Buffer the data in the BufferedReader
 2. Accumulate data in a reusable buffer for the current token


Reusable buffers are usually death in terms of subtle bugs and they rarely
actually help that much.  The key is to avoid copying.  Cons'ing up and
then collecting pointerized structures isn't that expensive.

3. Turn the token buffer into a String


Also EVIL.  Musn't convert to string unless that is really what you want.


 I was also thinking at something similar to reduce the string copies. The
 token from the CSVLexer could probably contain a CharSequence instead of a
 String. The CharSequence would be backed by the same array for all the
 fields of the record. Thus if a field isn't read by the user we don't pay
 the cost to convert it into a String. But this prevents the reuse of the
 buffer, and that means more work for the GC.


Just moving around char costs twice as much as moving around bytes for most
CSV data.  I would avoid that if possible.

I wouldn't worry about the GC.  The experience in Hadoop and Lucene is that
the effort made to avoid allocating light weight structures was very
misguided.  My own experiments have never shown a big benefit unless you
conflate cons'ing the structures with copying lots of data.  If you avoid
the copy, the construction and collection of ephemeral structures turns out
to be very nearly free.


 Emmanuel Bourg


 Le 15/03/2012 15:49, Ted Dunning a écrit :

 I built a limited CSV package for parsing data in Mahout at one point.  I
 doubt that it was general enough to be helpful here, but the experience
 might be.

 The thing that *really* made a big difference in speed was to avoid copies
 and conversions to String.  To do that, I built a state machine that
 operated on bytes to do the parsing from byte arrays.  The parser passed
 around offsets only.  Then when converting data, I converted directly from
 the original byte array into the target type.  For the most common case
 (in
 my data) of converting to Integers, this eliminated masses of cons'ing and
 because the conversion was special purpose (I assumed UTF8 encoding and
 assumed that numbers could only use ASCII range digits), the conversion to
 integers was particularly fast.

 Overall, this made about a 20x difference in speed.  This is not 20%; the
 final time was 5% of the original.

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 13:34, sebb a écrit :

In my testing, using final class variables for delimiter, escape etc
(set in ctor) shaves about 1 sec off the time to read the world town
data file compared with accessing these fields inline through the
format field.

Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

I suspect this is partly because the fetches are currently in loops
rather than any getter overhead.


Did you run with the client or the server VM? My tests were all with the 
server VM.



Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread Benedikt Ritter

Am 15. März 2012 13:50 schrieb Gary Gregory garydgreg...@gmail.com:
 Can you put your perf test code and resources in SVN so I do not have to
 write on please?


Hi Gary,

have a look at http://markmail.org/message/x73i3hl63rjqdyfa (I agree
with you, that having a clean performance test in SVN would be better)

Regards,
Benedikt

 Gary

 On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote:

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.

 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 JUnit in Action, 2nd Ed: http://goog_1249600977http://bit.ly/ECvg0
 Spring Batch in Action: http://s.apache.org/HOqhttp://bit.ly/bqpbCK
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread sebb

On 15 March 2012 12:43, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 13:34, sebb a écrit :

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.


 Did you run with the client or the server VM? My tests were all with the
 server VM.

Eclipse, so probably client VM?


 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 14:13, sebb a écrit :


Eclipse, so probably client VM?


Probably. You can print the java.vm.name system property at the 
beginning of the test, that will tell you the VM used.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread Ted Dunning

I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.

On Thu, Mar 15, 2012 at 8:34 AM, sebb seb...@gmail.com wrote:

 In my testing, using final class variables for delimiter, escape etc
 (set in ctor) shaves about 1 sec off the time to read the world town
 data file compared with accessing these fields inline through the
 format field.

 Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.

 I suspect this is partly because the fetches are currently in loops
 rather than any getter overhead.

 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg

Thank you for sharing your experience Ted. Do you have a link to the 
code of your parser? I'd like to get a look.


Currently the data flow in Commons CSV is:

1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a String

I was also thinking at something similar to reduce the string copies. 
The token from the CSVLexer could probably contain a CharSequence 
instead of a String. The CharSequence would be backed by the same array 
for all the fields of the record. Thus if a field isn't read by the user 
we don't pay the cost to convert it into a String. But this prevents the 
reuse of the buffer, and that means more work for the GC.


Emmanuel Bourg


Le 15/03/2012 15:49, Ted Dunning a écrit :

I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread sebb

On 15 March 2012 13:17, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 14:13, sebb a écrit :


 Eclipse, so probably client VM?


 Probably. You can print the java.vm.name system property at the beginning of
 the test, that will tell you the VM used.

It was client.

I've now tried with server, and it is slower when using class fields
created by the ctor.
Curious - perhaps it is a data localisation issue, though I would have
thought the optimiser could have fetched those into local variables.

So I then tried hauling the format method calls out of the loops into
final local variables.
This improves performance (slightly) in both client and server mode.

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 16:45, sebb a écrit :


So I then tried hauling the format method calls out of the loops into
final local variables.
This improves performance (slightly) in both client and server mode.


Could you show some code please? I'm unable to reproduce this. I used 
local variables in simpleTokenLexer and the performance degrades by 14% 
(HotSpot server VM, JDK 6u31, Core 2 Duo).


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread sebb

On 15 March 2012 16:10, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 16:45, sebb a écrit :


 So I then tried hauling the format method calls out of the loops into
 final local variables.
 This improves performance (slightly) in both client and server mode.


 Could you show some code please?

Ditto.

 I'm unable to reproduce this. I used local
 variables in simpleTokenLexer and the performance degrades by 14% (HotSpot
 server VM, JDK 6u31, Core 2 Duo).

I also used local vars in the other methods.

See http://people.apache.org/~sebb/CSV/

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 17:42, sebb a écrit :


I also used local vars in the other methods.

See http://people.apache.org/~sebb/CSV/


Thank you. You also reordered the if in simpleTokenLexer, that may 
explain the difference. I'll give it a try.


Emmanuel Bourg




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread sebb

On 15 March 2012 16:48, Emmanuel Bourg ebo...@apache.org wrote:
 Le 15/03/2012 17:42, sebb a écrit :


 I also used local vars in the other methods.

 See http://people.apache.org/~sebb/CSV/


 Thank you. You also reordered the if in simpleTokenLexer, that may explain
 the difference. I'll give it a try.

I'd forgotten about that; I thought I'd reverted that.

If I revert it, I still get a better time with Lexer2, though not
quite as good an improvement.

 Emmanuel Bourg



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 18:06, sebb a écrit :


If I revert it, I still get a better time with Lexer2, though not
quite as good an improvement.


I ran my perf test, Lexer2 is slower on my system :( The order of the if 
doesn't change much here.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

42 matches

Mail list logo