On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[email protected]> wrote: > On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[email protected]> wrote: > > The ItemSimilarityJob actually uses implementations of the Vector > > class hierarchy? I think that's the issue - if the on-disk and in-mapper > > representations are never Vectors, then they won't interoperate with > > any of the matrix operations... > > Yes they are Vectors. >
Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses these, on trunk currently? I don't see any mappers which take in int, vector pairs... > Oh I see. Well that's not a problem. Already, IDs have to be mapped to > ints to be used as dimensions in a Vector. So in most cases things are > keyed by these int pseudo-IDs. That's OK too. > > A matrix is a bunch of vectors -- at least, that's a nice structure > for a SequenceFile. Row (or col) ID mapped to row (column) vector. > > is that not what other jobs are using? > what's the better alternative we could think about converging on. > Yes, as long as the *on HDFS* representation is a SequenceFile<IntWritable,VectorWritable>, we can interoperate. Or now that you've moved on to VIntWritable, I should migrate the distributed matrix stuff to do the same. And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses are reusable and would reduce replicated work as well... -jake
