The intermediate serialization format was improved in as part of - https://issues.apache.org/jira/browse/PIG-1472, it includes the move to varint like format.

BinStorage is also used by users as regular load/store func, so its binary format could not be changed, and a new internal load/store func was introduced.

Thanks,
Thejas


On 8/20/11 10:41 AM, Kevin Burton wrote:
0.9.0 ... I will investigate ....

--
Kevin Burton


On Aug 20, 2011, at 8:31 AM, Alan Gates<[email protected]>  wrote:

What version of Pig are you using?  I believe we moved off of BinStorage in 0.8 
and started using varint around the same time.

Alan.

On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:

I ran the benchmark now and results are pretty interesting.  Varint is about
2x performance and 1/2 the size of using integer.toString…



varint encoding
duration: 37,355 ms
string encoding
duration: 74,178 ms

varint total bytes: 2,229,450,884
string total bytes: 4,388,888,890



       long before,after;
       int max = 500000000;

       System.gc();

       before = System.currentTimeMillis();

       System.out.printf( "varint encoding\n" );

       VarintReader reader = new VarintReader();
       VarintWriter writer = new VarintWriter();

       for ( int i = 0; i<  max; ++i ) {
           reader.read( writer.write( i ) );
       }

       after = System.currentTimeMillis();

       System.gc();

       System.out.printf( "duration: %,d ms\n", (after-before) );

       before = System.currentTimeMillis();

       System.out.printf( "string encoding\n" );

       for ( int i = 0; i<  max; ++i ) {
           Integer.parseInt( Integer.toString( i ) );
       }

       after = System.currentTimeMillis();

       System.gc();

       System.out.printf( "duration: %,d ms\n", (after-before) );







On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton<[email protected]>  wrote:

I'm looking at BinStorage which I believe if I've read correct is used for
all Pig intermediate files.

… so any optimizations here would be transparent to the user.

I just did a simple STORE using BinStorage and the format doesn't appear
amazingly concise.

for example:

0 {(1),(2),(3),(4),(1000000000)}

is the following in BinStorage:

n20xn21n22n23n24n2
1000000000

or

00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
|...n....2....0x.|
00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
|.......n....2...|
00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
|.1n....2....2n..|
00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
|..2....3n....2..|
00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
|..4n....2....100|
00000050  30 30 30 30 30 30 30                              |0000000|
00000057

… now, efficient integer storage is a controversial topic.

if you have short integers representing them as four bytes will waste a ton
of space.

implementing them as varints is usually a good compromise:

http://code.google.com/apis/protocolbuffers/docs/encoding.html

7 bits of each byte are used to represent the int.  One of the bits is used
to signal whether there is a next bit which needs to be read.

In my job , most of my ints will be stored in 4-8 bytes….. however, in
varint encoding they would only be 2 bytes.

A 4x savings in disk space could significantly improve performance.

I haven't benchmarked CPU of variants vs Integer.toString() though …. which
I might do now.

Still…. even if varint encoding is slower, using 4 bytes for some uses
could be a win.

--

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*




--

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*


Reply via email to