On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[email protected]> wrote: > The ItemSimilarityJob actually uses implementations of the Vector > class hierarchy? I think that's the issue - if the on-disk and in-mapper > representations are never Vectors, then they won't interoperate with > any of the matrix operations...
Yes they are Vectors. > And yeah, keying on ints is necessary for now, unless we want to > make a new matrix type (at least for distributed matrices) which > keys on longs (which actually might be a good idea: now that > we're using VInt and VLong, the disk space and network usage > should be not be adversely affected - just the in-memory > representation). Oh I see. Well that's not a problem. Already, IDs have to be mapped to ints to be used as dimensions in a Vector. So in most cases things are keyed by these int pseudo-IDs. That's OK too. A matrix is a bunch of vectors -- at least, that's a nice structure for a SequenceFile. Row (or col) ID mapped to row (column) vector. is that not what other jobs are using? what's the better alternative we could think about converging on.
