Re: Generating a Document Similarity Matrix

Sean Owen Wed, 09 Jun 2010 11:36:43 -0700

Nope I'm dreaming. These jobs do use custom output formats. I hadn't
really looked closely either. (Everything else uses vectors.) Now I
imagine there is some reason but yeah it would be much better to
operate in terms of vectors if possible.


Sebastian is there a reason Vectors couldn't be used?

On Wed, Jun 9, 2010 at 7:33 PM, Jake Mannix <[email protected]> wrote:
> On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[email protected]> wrote:
>
>> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[email protected]> wrote:
>> > The ItemSimilarityJob actually uses implementations of the Vector
>> > class hierarchy?  I think that's the issue - if the on-disk and in-mapper
>> > representations are never Vectors, then they won't interoperate with
>> > any of the matrix operations...
>>
>> Yes they are Vectors.
>>
>
> Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses
> these, on trunk currently?  I don't see any mappers which take in
> int, vector pairs...
>
>
>> Oh I see. Well that's not a problem. Already, IDs have to be mapped to
>> ints to be used as dimensions in a Vector. So in most cases things are
>> keyed by these int pseudo-IDs. That's OK too.
>>
>> A matrix is a bunch of vectors -- at least, that's a nice structure
>> for a SequenceFile. Row (or col) ID mapped to row (column) vector.
>>
>> is that not what other jobs are using?
>> what's the better alternative we could think about converging on.
>>
>
> Yes, as long as the *on HDFS* representation is a
> SequenceFile<IntWritable,VectorWritable>, we can interoperate.  Or
> now that you've moved on to VIntWritable, I should migrate the distributed
> matrix stuff to do the same.
>
> And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses
> are reusable and would reduce replicated work as well...
>
>  -jake
>

Re: Generating a Document Similarity Matrix

Reply via email to