Ahh... sorry. That was my first impression, but then somebody said values (meaning *KEY* values, of course) and I jumped the tracks.
On Wed, May 25, 2011 at 2:58 PM, Jake Mannix <[email protected]> wrote: > We're dealing with the keys, not the values. But yes, the binary case, you > don't > even need matrix entries. Just the keys. > > On Wed, May 25, 2011 at 2:49 PM, Ted Dunning <[email protected]> > wrote: > > > The limiting case here is binary matrices where it really would be nice > to > > have vectors be 20-30x smaller. These are a common case. > > > > On Wed, May 25, 2011 at 2:28 PM, Jake Mannix <[email protected]> > > wrote: > > > > > On Wed, May 25, 2011 at 2:23 PM, Sean Owen <[email protected]> wrote: > > > > > > > Variable-length saves space for values under about 2^21 ~= 2M. It's a > > > wash > > > > for values up to about 2^28 ~= 268M. It costs an extra byte for > larger > > > > values. I'm thinking unsigned values here at the moment, and ignoring > > the > > > > CPU costs of encoding/decoding, which is tiny. > > > > > > > > Yes it's a loss for 15/16ths of the key space. My big assumption is > > that > > > in > > > > many cases that first 1/16th is heavily used. It's certainly true > when > > > > values are counts, and true when they're product IDs. When they're > > > hashes, > > > > nope. > > > > > > > > > > Ok, good point: when they're counts, they're probably a huge savings. > > > Those > > > are often very small. > > > > > > For IDs, the DistributedRowMatrix probably shouldn't even be used for > > > numbers > > > of rows under the 10's of millions - otherwise you can probably fit in > > > memory. > > > > > > -jake > > > > > >
