I ran the benchmark now and results are pretty interesting.  Varint is about
2x performance and 1/2 the size of using integer.toString…



varint encoding
duration: 37,355 ms
string encoding
duration: 74,178 ms

varint total bytes: 2,229,450,884
string total bytes: 4,388,888,890



        long before,after;
        int max = 500000000;

        System.gc();

        before = System.currentTimeMillis();

        System.out.printf( "varint encoding\n" );

        VarintReader reader = new VarintReader();
        VarintWriter writer = new VarintWriter();

        for ( int i = 0; i < max; ++i ) {
            reader.read( writer.write( i ) );
        }

        after = System.currentTimeMillis();

        System.gc();

        System.out.printf( "duration: %,d ms\n", (after-before) );

        before = System.currentTimeMillis();

        System.out.printf( "string encoding\n" );

        for ( int i = 0; i < max; ++i ) {
            Integer.parseInt( Integer.toString( i ) );
        }

        after = System.currentTimeMillis();

        System.gc();

        System.out.printf( "duration: %,d ms\n", (after-before) );







On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <[email protected]> wrote:

> I'm looking at BinStorage which I believe if I've read correct is used for
> all Pig intermediate files.
>
> … so any optimizations here would be transparent to the user.
>
> I just did a simple STORE using BinStorage and the format doesn't appear
> amazingly concise.
>
> for example:
>
> 0 {(1),(2),(3),(4),(1000000000)}
>
> is the following in BinStorage:
>
> n20xn21n22n23n24n2
> 1000000000
>
> or
>
> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>  |...n....2....0x.|
> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>  |.......n....2...|
> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>  |.1n....2....2n..|
> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>  |..2....3n....2..|
> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>  |..4n....2....100|
> 00000050  30 30 30 30 30 30 30                              |0000000|
> 00000057
>
> … now, efficient integer storage is a controversial topic.
>
> if you have short integers representing them as four bytes will waste a ton
> of space.
>
> implementing them as varints is usually a good compromise:
>
> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>
> 7 bits of each byte are used to represent the int.  One of the bits is used
> to signal whether there is a next bit which needs to be read.
>
> In my job , most of my ints will be stored in 4-8 bytes….. however, in
> varint encoding they would only be 2 bytes.
>
> A 4x savings in disk space could significantly improve performance.
>
> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
> I might do now.
>
> Still…. even if varint encoding is slower, using 4 bytes for some uses
> could be a win.
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>
>


-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to