Nope I'm dreaming. These jobs do use custom output formats. I hadn't really looked closely either. (Everything else uses vectors.) Now I imagine there is some reason but yeah it would be much better to operate in terms of vectors if possible.
Sebastian is there a reason Vectors couldn't be used? On Wed, Jun 9, 2010 at 7:33 PM, Jake Mannix <[email protected]> wrote: > On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[email protected]> wrote: > >> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[email protected]> wrote: >> > The ItemSimilarityJob actually uses implementations of the Vector >> > class hierarchy? I think that's the issue - if the on-disk and in-mapper >> > representations are never Vectors, then they won't interoperate with >> > any of the matrix operations... >> >> Yes they are Vectors. >> > > Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses > these, on trunk currently? I don't see any mappers which take in > int, vector pairs... > > >> Oh I see. Well that's not a problem. Already, IDs have to be mapped to >> ints to be used as dimensions in a Vector. So in most cases things are >> keyed by these int pseudo-IDs. That's OK too. >> >> A matrix is a bunch of vectors -- at least, that's a nice structure >> for a SequenceFile. Row (or col) ID mapped to row (column) vector. >> >> is that not what other jobs are using? >> what's the better alternative we could think about converging on. >> > > Yes, as long as the *on HDFS* representation is a > SequenceFile<IntWritable,VectorWritable>, we can interoperate. Or > now that you've moved on to VIntWritable, I should migrate the distributed > matrix stuff to do the same. > > And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses > are reusable and would reduce replicated work as well... > > -jake >
