Re: Generating a Document Similarity Matrix

Jake Mannix Wed, 09 Jun 2010 11:33:57 -0700

On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[email protected]> wrote:

> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[email protected]> wrote:
> > The ItemSimilarityJob actually uses implementations of the Vector
> > class hierarchy?  I think that's the issue - if the on-disk and in-mapper
> > representations are never Vectors, then they won't interoperate with
> > any of the matrix operations...
>
> Yes they are Vectors.
>


Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses
these, on trunk currently?  I don't see any mappers which take in
int, vector pairs...


> Oh I see. Well that's not a problem. Already, IDs have to be mapped to
> ints to be used as dimensions in a Vector. So in most cases things are
> keyed by these int pseudo-IDs. That's OK too.
>
> A matrix is a bunch of vectors -- at least, that's a nice structure
> for a SequenceFile. Row (or col) ID mapped to row (column) vector.
>
> is that not what other jobs are using?
> what's the better alternative we could think about converging on.
>

Yes, as long as the *on HDFS* representation is a
SequenceFile<IntWritable,VectorWritable>, we can interoperate.  Or
now that you've moved on to VIntWritable, I should migrate the distributed
matrix stuff to do the same.

And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses
are reusable and would reduce replicated work as well...

  -jake

Re: Generating a Document Similarity Matrix

Reply via email to