I'm looking at BinStorage which I believe if I've read correct is used for
all Pig intermediate files.
… so any optimizations here would be transparent to the user.
I just did a simple STORE using BinStorage and the format doesn't appear
amazingly concise.
for example:
0 {(1),(2),(3),(4),(1000000000)}
is the following in BinStorage:
n20xn21n22n23n24n2
1000000000
or
00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
|...n....2....0x.|
00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
|.......n....2...|
00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
|.1n....2....2n..|
00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
|..2....3n....2..|
00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
|..4n....2....100|
00000050 30 30 30 30 30 30 30 |0000000|
00000057
… now, efficient integer storage is a controversial topic.
if you have short integers representing them as four bytes will waste a ton
of space.
implementing them as varints is usually a good compromise:
http://code.google.com/apis/protocolbuffers/docs/encoding.html
7 bits of each byte are used to represent the int. One of the bits is used
to signal whether there is a next bit which needs to be read.
In my job , most of my ints will be stored in 4-8 bytes….. however, in
varint encoding they would only be 2 bytes.
A 4x savings in disk space could significantly improve performance.
I haven't benchmarked CPU of variants vs Integer.toString() though …. which
I might do now.
Still…. even if varint encoding is slower, using 4 bytes for some uses could
be a win.
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
Skype-in: *(415) 871-0687*