Re: How to convert SequenceFile to SequenceFile?

Ted Dunning Wed, 25 May 2011 14:50:41 -0700

The limiting case here is binary matrices where it really would be nice to
have vectors be 20-30x smaller.  These are a common case.


On Wed, May 25, 2011 at 2:28 PM, Jake Mannix <[email protected]> wrote:

> On Wed, May 25, 2011 at 2:23 PM, Sean Owen <[email protected]> wrote:
>
> > Variable-length saves space for values under about 2^21 ~= 2M. It's a
> wash
> > for values up to about 2^28 ~= 268M. It costs an extra byte for larger
> > values. I'm thinking unsigned values here at the moment, and ignoring the
> > CPU costs of encoding/decoding, which is tiny.
> >
> > Yes it's a loss for 15/16ths of the key space. My big assumption is that
> in
> > many cases that first 1/16th is heavily used. It's certainly true when
> > values are counts, and true when they're product IDs. When they're
> hashes,
> > nope.
> >
>
> Ok, good point: when they're counts, they're probably a huge savings.
>  Those
> are often very small.
>
> For IDs, the DistributedRowMatrix probably shouldn't even be used for
> numbers
> of rows under the 10's of millions - otherwise you can probably fit in
> memory.
>
>  -jake
>

Re: How to convert SequenceFile to SequenceFile?

Reply via email to