Re: How to convert SequenceFile to SequenceFile?

Sean Owen Wed, 25 May 2011 14:24:09 -0700

Variable-length saves space for values under about 2^21 ~= 2M. It's a wash
for values up to about 2^28 ~= 268M. It costs an extra byte for larger
values. I'm thinking unsigned values here at the moment, and ignoring the
CPU costs of encoding/decoding, which is tiny.

Yes it's a loss for 15/16ths of the key space. My big assumption is that in
many cases that first 1/16th is heavily used. It's certainly true when
values are counts, and true when they're product IDs. When they're hashes,
nope.

I actually don't know the situation here -- if it's a bad idea we don't do
it. But there are surely places where it's a good idea.

On Wed, May 25, 2011 at 10:14 PM, Jake Mannix <[email protected]> wrote:
>
> If you have more than a 32M IDs *total*, even if they are sequential
> starting
> at 0 "most" of them will take up the full 4 bytes, and only a tiny fraction
> will
> take up less than 3.
>

Re: How to convert SequenceFile to SequenceFile?

Reply via email to