0.9.0 ... I will investigate .... 

--
Kevin Burton


On Aug 20, 2011, at 8:31 AM, Alan Gates <[email protected]> wrote:

> What version of Pig are you using?  I believe we moved off of BinStorage in 
> 0.8 and started using varint around the same time.
> 
> Alan.
> 
> On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
> 
>> I ran the benchmark now and results are pretty interesting.  Varint is about
>> 2x performance and 1/2 the size of using integer.toString…
>> 
>> 
>> 
>> varint encoding
>> duration: 37,355 ms
>> string encoding
>> duration: 74,178 ms
>> 
>> varint total bytes: 2,229,450,884
>> string total bytes: 4,388,888,890
>> 
>> 
>> 
>>       long before,after;
>>       int max = 500000000;
>> 
>>       System.gc();
>> 
>>       before = System.currentTimeMillis();
>> 
>>       System.out.printf( "varint encoding\n" );
>> 
>>       VarintReader reader = new VarintReader();
>>       VarintWriter writer = new VarintWriter();
>> 
>>       for ( int i = 0; i < max; ++i ) {
>>           reader.read( writer.write( i ) );
>>       }
>> 
>>       after = System.currentTimeMillis();
>> 
>>       System.gc();
>> 
>>       System.out.printf( "duration: %,d ms\n", (after-before) );
>> 
>>       before = System.currentTimeMillis();
>> 
>>       System.out.printf( "string encoding\n" );
>> 
>>       for ( int i = 0; i < max; ++i ) {
>>           Integer.parseInt( Integer.toString( i ) );
>>       }
>> 
>>       after = System.currentTimeMillis();
>> 
>>       System.gc();
>> 
>>       System.out.printf( "duration: %,d ms\n", (after-before) );
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <[email protected]> wrote:
>> 
>>> I'm looking at BinStorage which I believe if I've read correct is used for
>>> all Pig intermediate files.
>>> 
>>> … so any optimizations here would be transparent to the user.
>>> 
>>> I just did a simple STORE using BinStorage and the format doesn't appear
>>> amazingly concise.
>>> 
>>> for example:
>>> 
>>> 0 {(1),(2),(3),(4),(1000000000)}
>>> 
>>> is the following in BinStorage:
>>> 
>>> n20xn21n22n23n24n2
>>> 1000000000
>>> 
>>> or
>>> 
>>> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>>> |...n....2....0x.|
>>> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>>> |.......n....2...|
>>> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>>> |.1n....2....2n..|
>>> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>>> |..2....3n....2..|
>>> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>>> |..4n....2....100|
>>> 00000050  30 30 30 30 30 30 30                              |0000000|
>>> 00000057
>>> 
>>> … now, efficient integer storage is a controversial topic.
>>> 
>>> if you have short integers representing them as four bytes will waste a ton
>>> of space.
>>> 
>>> implementing them as varints is usually a good compromise:
>>> 
>>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>> 
>>> 7 bits of each byte are used to represent the int.  One of the bits is used
>>> to signal whether there is a next bit which needs to be read.
>>> 
>>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>>> varint encoding they would only be 2 bytes.
>>> 
>>> A 4x savings in disk space could significantly improve performance.
>>> 
>>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>>> I might do now.
>>> 
>>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>>> could be a win.
>>> 
>>> --
>>> 
>>> Founder/CEO Spinn3r.com
>>> 
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> 
>>> Skype-in: *(415) 871-0687*
>>> 
>>> 
>> 
>> 
>> -- 
>> 
>> Founder/CEO Spinn3r.com
>> 
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> 
>> Skype-in: *(415) 871-0687*
> 

Reply via email to