What version of Pig are you using? I believe we moved off of BinStorage in 0.8
and started using varint around the same time.
Alan.
On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
> I ran the benchmark now and results are pretty interesting. Varint is about
> 2x performance and 1/2 the size of using integer.toString…
>
>
>
> varint encoding
> duration: 37,355 ms
> string encoding
> duration: 74,178 ms
>
> varint total bytes: 2,229,450,884
> string total bytes: 4,388,888,890
>
>
>
> long before,after;
> int max = 500000000;
>
> System.gc();
>
> before = System.currentTimeMillis();
>
> System.out.printf( "varint encoding\n" );
>
> VarintReader reader = new VarintReader();
> VarintWriter writer = new VarintWriter();
>
> for ( int i = 0; i < max; ++i ) {
> reader.read( writer.write( i ) );
> }
>
> after = System.currentTimeMillis();
>
> System.gc();
>
> System.out.printf( "duration: %,d ms\n", (after-before) );
>
> before = System.currentTimeMillis();
>
> System.out.printf( "string encoding\n" );
>
> for ( int i = 0; i < max; ++i ) {
> Integer.parseInt( Integer.toString( i ) );
> }
>
> after = System.currentTimeMillis();
>
> System.gc();
>
> System.out.printf( "duration: %,d ms\n", (after-before) );
>
>
>
>
>
>
>
> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <[email protected]> wrote:
>
>> I'm looking at BinStorage which I believe if I've read correct is used for
>> all Pig intermediate files.
>>
>> … so any optimizations here would be transparent to the user.
>>
>> I just did a simple STORE using BinStorage and the format doesn't appear
>> amazingly concise.
>>
>> for example:
>>
>> 0 {(1),(2),(3),(4),(1000000000)}
>>
>> is the following in BinStorage:
>>
>> n20xn21n22n23n24n2
>> 1000000000
>>
>> or
>>
>> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
>> |...n....2....0x.|
>> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
>> |.......n....2...|
>> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
>> |.1n....2....2n..|
>> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
>> |..2....3n....2..|
>> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
>> |..4n....2....100|
>> 00000050 30 30 30 30 30 30 30 |0000000|
>> 00000057
>>
>> … now, efficient integer storage is a controversial topic.
>>
>> if you have short integers representing them as four bytes will waste a ton
>> of space.
>>
>> implementing them as varints is usually a good compromise:
>>
>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>
>> 7 bits of each byte are used to represent the int. One of the bits is used
>> to signal whether there is a next bit which needs to be read.
>>
>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>> varint encoding they would only be 2 bytes.
>>
>> A 4x savings in disk space could significantly improve performance.
>>
>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>> I might do now.
>>
>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>> could be a win.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>>
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>>
>> Skype-in: *(415) 871-0687*
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*