On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[email protected]> wrote:
> Just ran into a problem trying to use IntWritable as my key when creating > vectors so I can use CBV0Driver. I'm using the helper > class SparseVectorsFromSequenceFiles to create my document vectors, and I > create my sequence file with IntWritable as the key. > SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the > documents, but, DocumentProcessor's output is key: Text, value: > StringTuple. This in turn causes an exception. > > So it looks like these helper classes that create sequence files of > VectorWritable, which are the input to a lot of these algorithms, are not > compatible with some of the newer algorithms, like CBV0Driver. Is that > correct? > $MAHOUT_HOME/bin/mahout rowid --help to the rescue. :) > > And coming back to CBV0Driver, if someone wants to use it, they'll have to > hand code the creation of VectorWritables, or possibly run the ones that > are created by SparseVectorsFromSequenceFiles through a transform, to > output IntWritable keys. Correct? > > BTW, not trying to sound critical, I'm just trying to understand the > architecture. Is this an issue that you guys are want to get > fixed/consistant at some point? Where all vector keys are IntWritables, > and all helper classes consume and output pairs that have IntWritable keys? > I might be interested in helping with that effort. > > Thanks, > JohnC > > > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[email protected]> > wrote: > > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[email protected]> wrote: > > > > > Hi Jake, > > > Thanks for the explanation. I actually prefer using ints as key > > > identifiers globally, vs a string. It can help keep memory and gc > > > utilization way down, especially in algorithms that have high iteration > > > counts. > > > > > > I had gone through an example that used the original LDA algorithm, and > > the > > > samples used the filename as the document key, vs some kind of integer > > > identifier, so I just went with that. It does make things easier when > > > looking at your output results, since you dont have to keep > > > some separate store that maps integer doc ids against friendly string > > > names, but I dont think that is really all that important. For the > long > > > run, in my opinion I would definitely standardize on IntWritable for > > vector > > > keys. > > > > > > > Yeah, avoiding having a separate store / mapping for "docId -> > > documentName" > > or whatnot is a good reason to not normalize this field, but since we > > already have > > to do this for the terms, for efficiency's sake, keeping an extra mapping > > for docs > > is not so much of a big deal, IMO. The only part in which this becomes > > annoying > > is that there aren't very many ints. Longs might be better, sometimes. > > Then again, > > *forcing* everyone to use big 8byte longs for stuff which easily fits in > > ints can be silly, > > and doing this for *both* row keys and column keys is wasting lots of > > space, but > > necessary for the idea of "transpose" or matrix multiplication to make > > sense. > > > > > > > > > > Thanks for the great explanation! > > > > > > > > No problem. > > > > -jake > > > > > > > JohnC > > > > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[email protected]> > > > wrote: > > > > > > > In general, workflows with matrices in Mahout handle > > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk > > format > > > of > > > > the class DistributedRowMatrix. The original Mahout LDA pre-dated > this > > > > move to standardize closer to that format, and so it didn't have that > > > > requirement. > > > > > > > > Now, as you say, it's true that in the new implementation, the keys > > > aren't > > > > actually > > > > used, so in principle we could just go with WritableComparable<?> in > > > > CVB0Driver's > > > > mappers/reducers keys. In fact, it would make certain integrations a > > > > little nicer, > > > > at the cost of pushing incompatibility somewhere else. For example, > > the > > > > output > > > > p(document | topic) distributions go into a SequenceFile whose keys > are > > > the > > > > same > > > > as the input corpus keys (ie the doc_id values), and there may be > > > workflows > > > > which > > > > take this matrix and transpose it to multiply it by another matrix or > > > > somethign of that > > > > nature. If the keys are IntWritable, this all works just fine. If > > not, > > > > then transpose > > > > will fail horribly, as will matrix multiplication. > > > > > > > > Standardizing on a common fixed format internally avoids some of > these > > > > problems, > > > > while at the same time being a bit inflexible. > > > > > > > > It's possible we could add a command-line option + some internal > > switches > > > > to allow > > > > the user to explicitly force untyped keys, or just warn on > non-integer > > > keys > > > > or > > > > something... > > > > > > > > I can just imagine seeing the questions on this very list when > someone > > > > takes the output > > > > of their Long-keyed corpus and try to matrix multiply it by some > other > > > > matrix... > > > > > > > > -jake > > > > > > > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[email protected]> > wrote: > > > > > > > > > I wanted to compare the two LDA implementations, and I noticed that > > for > > > > the > > > > > input corpus sequence file file (key: doc_id, value: vector), the > Key > > > for > > > > > the input file for LDADriver takes any WritableComparable<?> key, > but > > > the > > > > > Key for the input file for CVB0Driver requires IntWritable > > explicitly. > > > > Is > > > > > there some reason these two LDA implementations cant both use > > > > > WritableComparable<?> for the key of the input sequence file? It > > would > > > > > make integrating them into application workflows much easier and > > > > > consistant. > > > > > > > > > > -- > > > > > > > > > > Thanks, > > > > > John C > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Thanks, > > > John C > > > > > > > > > -- > > Thanks, > John C >
