Re: How to convert SequenceFile to SequenceFile?

Jake Mannix Wed, 25 May 2011 14:33:58 -0700

On Wed, May 25, 2011 at 2:23 PM, Sean Owen <[email protected]> wrote:

> Variable-length saves space for values under about 2^21 ~= 2M. It's a wash
> for values up to about 2^28 ~= 268M. It costs an extra byte for larger
> values. I'm thinking unsigned values here at the moment, and ignoring the
> CPU costs of encoding/decoding, which is tiny.
>
> Yes it's a loss for 15/16ths of the key space. My big assumption is that in
> many cases that first 1/16th is heavily used. It's certainly true when
> values are counts, and true when they're product IDs. When they're hashes,
> nope.
>


Ok, good point: when they're counts, they're probably a huge savings.  Those
are often very small.

For IDs, the DistributedRowMatrix probably shouldn't even be used for
numbers
of rows under the 10's of millions - otherwise you can probably fit in
memory.

  -jake

Re: How to convert SequenceFile to SequenceFile?

Reply via email to