On Wed, May 25, 2011 at 2:23 PM, Sean Owen <[email protected]> wrote: > Variable-length saves space for values under about 2^21 ~= 2M. It's a wash > for values up to about 2^28 ~= 268M. It costs an extra byte for larger > values. I'm thinking unsigned values here at the moment, and ignoring the > CPU costs of encoding/decoding, which is tiny. > > Yes it's a loss for 15/16ths of the key space. My big assumption is that in > many cases that first 1/16th is heavily used. It's certainly true when > values are counts, and true when they're product IDs. When they're hashes, > nope. >
Ok, good point: when they're counts, they're probably a huge savings. Those are often very small. For IDs, the DistributedRowMatrix probably shouldn't even be used for numbers of rows under the 10's of millions - otherwise you can probably fit in memory. -jake
