Re: How to convert SequenceFile to SequenceFile?

Ted Dunning Wed, 25 May 2011 15:13:49 -0700

Ahh... sorry.  That was my first impression, but then somebody said values
(meaning *KEY* values, of course) and I jumped the tracks.


On Wed, May 25, 2011 at 2:58 PM, Jake Mannix <[email protected]> wrote:

> We're dealing with the keys, not the values.  But yes, the binary case, you
> don't
> even need matrix entries.  Just the keys.
>
> On Wed, May 25, 2011 at 2:49 PM, Ted Dunning <[email protected]>
> wrote:
>
> > The limiting case here is binary matrices where it really would be nice
> to
> > have vectors be 20-30x smaller.  These are a common case.
> >
> > On Wed, May 25, 2011 at 2:28 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > On Wed, May 25, 2011 at 2:23 PM, Sean Owen <[email protected]> wrote:
> > >
> > > > Variable-length saves space for values under about 2^21 ~= 2M. It's a
> > > wash
> > > > for values up to about 2^28 ~= 268M. It costs an extra byte for
> larger
> > > > values. I'm thinking unsigned values here at the moment, and ignoring
> > the
> > > > CPU costs of encoding/decoding, which is tiny.
> > > >
> > > > Yes it's a loss for 15/16ths of the key space. My big assumption is
> > that
> > > in
> > > > many cases that first 1/16th is heavily used. It's certainly true
> when
> > > > values are counts, and true when they're product IDs. When they're
> > > hashes,
> > > > nope.
> > > >
> > >
> > > Ok, good point: when they're counts, they're probably a huge savings.
> > >  Those
> > > are often very small.
> > >
> > > For IDs, the DistributedRowMatrix probably shouldn't even be used for
> > > numbers
> > > of rows under the 10's of millions - otherwise you can probably fit in
> > > memory.
> > >
> > >  -jake
> > >
> >
>

Re: How to convert SequenceFile to SequenceFile?

Reply via email to