If keys are distributed across the keyspace then yes it is a net loss to try variable-length encoding. However it's my impression that keys aren't in many contexts. (I actually haven't thought about this one hard.)
But for example in recommender-land where keys are product IDs, it's more common for there to be millions of keys ranging in value to, well, a few million, than spread across the key space. On Wed, May 25, 2011 at 9:37 PM, Jake Mannix <[email protected]> wrote: > On Wed, May 25, 2011 at 1:33 PM, Sean Owen <[email protected]> wrote: > > > (I suggest we not use IntWritable or LongWritable, but favor > VarIntWritable > > and VarLongWritable, which are variable length encoding versions, where > > possible. Saving a couple bytes per key adds up.) > > > > If you have millions to hundreds of millions of keys, how many of them are > going to be low enough to fit in less than 4 bytes? As soon as you have > more than 16 million, "most" numbers take up the full 4 bytes, right? > > -jake >
