0.9.0 ... I will investigate .... -- Kevin Burton
On Aug 20, 2011, at 8:31 AM, Alan Gates <[email protected]> wrote: > What version of Pig are you using? I believe we moved off of BinStorage in > 0.8 and started using varint around the same time. > > Alan. > > On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote: > >> I ran the benchmark now and results are pretty interesting. Varint is about >> 2x performance and 1/2 the size of using integer.toString… >> >> >> >> varint encoding >> duration: 37,355 ms >> string encoding >> duration: 74,178 ms >> >> varint total bytes: 2,229,450,884 >> string total bytes: 4,388,888,890 >> >> >> >> long before,after; >> int max = 500000000; >> >> System.gc(); >> >> before = System.currentTimeMillis(); >> >> System.out.printf( "varint encoding\n" ); >> >> VarintReader reader = new VarintReader(); >> VarintWriter writer = new VarintWriter(); >> >> for ( int i = 0; i < max; ++i ) { >> reader.read( writer.write( i ) ); >> } >> >> after = System.currentTimeMillis(); >> >> System.gc(); >> >> System.out.printf( "duration: %,d ms\n", (after-before) ); >> >> before = System.currentTimeMillis(); >> >> System.out.printf( "string encoding\n" ); >> >> for ( int i = 0; i < max; ++i ) { >> Integer.parseInt( Integer.toString( i ) ); >> } >> >> after = System.currentTimeMillis(); >> >> System.gc(); >> >> System.out.printf( "duration: %,d ms\n", (after-before) ); >> >> >> >> >> >> >> >> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <[email protected]> wrote: >> >>> I'm looking at BinStorage which I believe if I've read correct is used for >>> all Pig intermediate files. >>> >>> … so any optimizations here would be transparent to the user. >>> >>> I just did a simple STORE using BinStorage and the format doesn't appear >>> amazingly concise. >>> >>> for example: >>> >>> 0 {(1),(2),(3),(4),(1000000000)} >>> >>> is the following in BinStorage: >>> >>> n20xn21n22n23n24n2 >>> 1000000000 >>> >>> or >>> >>> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00 >>> |...n....2....0x.| >>> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00 >>> |.......n....2...| >>> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00 >>> |.1n....2....2n..| >>> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00 >>> |..2....3n....2..| >>> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30 >>> |..4n....2....100| >>> 00000050 30 30 30 30 30 30 30 |0000000| >>> 00000057 >>> >>> … now, efficient integer storage is a controversial topic. >>> >>> if you have short integers representing them as four bytes will waste a ton >>> of space. >>> >>> implementing them as varints is usually a good compromise: >>> >>> http://code.google.com/apis/protocolbuffers/docs/encoding.html >>> >>> 7 bits of each byte are used to represent the int. One of the bits is used >>> to signal whether there is a next bit which needs to be read. >>> >>> In my job , most of my ints will be stored in 4-8 bytes….. however, in >>> varint encoding they would only be 2 bytes. >>> >>> A 4x savings in disk space could significantly improve performance. >>> >>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which >>> I might do now. >>> >>> Still…. even if varint encoding is slower, using 4 bytes for some uses >>> could be a win. >>> >>> -- >>> >>> Founder/CEO Spinn3r.com >>> >>> Location: *San Francisco, CA* >>> Skype: *burtonator* >>> >>> Skype-in: *(415) 871-0687* >>> >>> >> >> >> -- >> >> Founder/CEO Spinn3r.com >> >> Location: *San Francisco, CA* >> Skype: *burtonator* >> >> Skype-in: *(415) 871-0687* >
