Oh you guys are sneaky. You thought of everything. Do you guys have future refactoring plans to standardize up on vector id data types?
On Tue, Jan 24, 2012 at 5:48 PM, Jake Mannix <[email protected]> wrote: > On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[email protected]> wrote: > > > Just ran into a problem trying to use IntWritable as my key when creating > > vectors so I can use CBV0Driver. I'm using the helper > > class SparseVectorsFromSequenceFiles to create my document vectors, and I > > create my sequence file with IntWritable as the key. > > SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the > > documents, but, DocumentProcessor's output is key: Text, value: > > StringTuple. This in turn causes an exception. > > > > So it looks like these helper classes that create sequence files of > > VectorWritable, which are the input to a lot of these algorithms, are not > > compatible with some of the newer algorithms, like CBV0Driver. Is that > > correct? > > > > $MAHOUT_HOME/bin/mahout rowid --help > > to the rescue. :) > > > > > > And coming back to CBV0Driver, if someone wants to use it, they'll have > to > > hand code the creation of VectorWritables, or possibly run the ones that > > are created by SparseVectorsFromSequenceFiles through a transform, to > > output IntWritable keys. Correct? > > > > BTW, not trying to sound critical, I'm just trying to understand the > > architecture. Is this an issue that you guys are want to get > > fixed/consistant at some point? Where all vector keys are IntWritables, > > and all helper classes consume and output pairs that have IntWritable > keys? > > I might be interested in helping with that effort. > > > > Thanks, > > JohnC > > > > > > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[email protected]> > > wrote: > > > > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[email protected]> wrote: > > > > > > > Hi Jake, > > > > Thanks for the explanation. I actually prefer using ints as key > > > > identifiers globally, vs a string. It can help keep memory and gc > > > > utilization way down, especially in algorithms that have high > iteration > > > > counts. > > > > > > > > I had gone through an example that used the original LDA algorithm, > and > > > the > > > > samples used the filename as the document key, vs some kind of > integer > > > > identifier, so I just went with that. It does make things easier > when > > > > looking at your output results, since you dont have to keep > > > > some separate store that maps integer doc ids against friendly string > > > > names, but I dont think that is really all that important. For the > > long > > > > run, in my opinion I would definitely standardize on IntWritable for > > > vector > > > > keys. > > > > > > > > > > Yeah, avoiding having a separate store / mapping for "docId -> > > > documentName" > > > or whatnot is a good reason to not normalize this field, but since we > > > already have > > > to do this for the terms, for efficiency's sake, keeping an extra > mapping > > > for docs > > > is not so much of a big deal, IMO. The only part in which this > becomes > > > annoying > > > is that there aren't very many ints. Longs might be better, sometimes. > > > Then again, > > > *forcing* everyone to use big 8byte longs for stuff which easily fits > in > > > ints can be silly, > > > and doing this for *both* row keys and column keys is wasting lots of > > > space, but > > > necessary for the idea of "transpose" or matrix multiplication to make > > > sense. > > > > > > > > > > > > > > Thanks for the great explanation! > > > > > > > > > > > No problem. > > > > > > -jake > > > > > > > > > > JohnC > > > > > > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[email protected]> > > > > wrote: > > > > > > > > > In general, workflows with matrices in Mahout handle > > > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk > > > format > > > > of > > > > > the class DistributedRowMatrix. The original Mahout LDA pre-dated > > this > > > > > move to standardize closer to that format, and so it didn't have > that > > > > > requirement. > > > > > > > > > > Now, as you say, it's true that in the new implementation, the keys > > > > aren't > > > > > actually > > > > > used, so in principle we could just go with WritableComparable<?> > in > > > > > CVB0Driver's > > > > > mappers/reducers keys. In fact, it would make certain > integrations a > > > > > little nicer, > > > > > at the cost of pushing incompatibility somewhere else. For > example, > > > the > > > > > output > > > > > p(document | topic) distributions go into a SequenceFile whose keys > > are > > > > the > > > > > same > > > > > as the input corpus keys (ie the doc_id values), and there may be > > > > workflows > > > > > which > > > > > take this matrix and transpose it to multiply it by another matrix > or > > > > > somethign of that > > > > > nature. If the keys are IntWritable, this all works just fine. If > > > not, > > > > > then transpose > > > > > will fail horribly, as will matrix multiplication. > > > > > > > > > > Standardizing on a common fixed format internally avoids some of > > these > > > > > problems, > > > > > while at the same time being a bit inflexible. > > > > > > > > > > It's possible we could add a command-line option + some internal > > > switches > > > > > to allow > > > > > the user to explicitly force untyped keys, or just warn on > > non-integer > > > > keys > > > > > or > > > > > something... > > > > > > > > > > I can just imagine seeing the questions on this very list when > > someone > > > > > takes the output > > > > > of their Long-keyed corpus and try to matrix multiply it by some > > other > > > > > matrix... > > > > > > > > > > -jake > > > > > > > > > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[email protected]> > > wrote: > > > > > > > > > > > I wanted to compare the two LDA implementations, and I noticed > that > > > for > > > > > the > > > > > > input corpus sequence file file (key: doc_id, value: vector), the > > Key > > > > for > > > > > > the input file for LDADriver takes any WritableComparable<?> key, > > but > > > > the > > > > > > Key for the input file for CVB0Driver requires IntWritable > > > explicitly. > > > > > Is > > > > > > there some reason these two LDA implementations cant both use > > > > > > WritableComparable<?> for the key of the input sequence file? It > > > would > > > > > > make integrating them into application workflows much easier and > > > > > > consistant. > > > > > > > > > > > > -- > > > > > > > > > > > > Thanks, > > > > > > John C > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Thanks, > > > > John C > > > > > > > > > > > > > > > -- > > > > Thanks, > > John C > > > -- Thanks, John C
