Oh you guys are sneaky.  You thought of everything.

Do you guys have future refactoring plans to standardize up on vector id
data types?

On Tue, Jan 24, 2012 at 5:48 PM, Jake Mannix <[email protected]> wrote:

> On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[email protected]> wrote:
>
> > Just ran into a problem trying to use IntWritable as my key when creating
> > vectors so I can use CBV0Driver.  I'm using the helper
> > class SparseVectorsFromSequenceFiles to create my document vectors, and I
> > create my sequence file with IntWritable as the key.
> >  SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the
> > documents, but, DocumentProcessor's output is key: Text, value:
> > StringTuple.  This in turn causes an exception.
> >
> > So it looks like these helper classes that create sequence files of
> > VectorWritable, which are the input to a lot of these algorithms, are not
> > compatible with some of the newer algorithms, like CBV0Driver.  Is that
> > correct?
> >
>
> $MAHOUT_HOME/bin/mahout rowid --help
>
> to the rescue. :)
>
>
> >
> > And coming back to CBV0Driver, if someone wants to use it, they'll have
> to
> > hand code the creation of VectorWritables, or possibly run the ones that
> > are created by SparseVectorsFromSequenceFiles through a transform, to
> > output IntWritable keys.  Correct?
> >
> > BTW, not trying to sound critical, I'm just trying to understand the
> > architecture.  Is this an issue that you guys are want to get
> > fixed/consistant at some point?  Where all vector keys are IntWritables,
> > and all helper classes consume and output pairs that have IntWritable
> keys?
> >  I might be interested in helping with that effort.
> >
> > Thanks,
> > JohnC
> >
> >
> > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[email protected]> wrote:
> > >
> > > > Hi Jake,
> > > > Thanks for the explanation.  I actually prefer using ints as key
> > > > identifiers globally, vs a string.  It can help keep memory and gc
> > > > utilization way down, especially in algorithms that have high
> iteration
> > > >  counts.
> > > >
> > > > I had gone through an example that used the original LDA algorithm,
> and
> > > the
> > > > samples used the filename as the document key, vs some kind of
> integer
> > > > identifier, so I just went with that.  It does make things easier
> when
> > > > looking at your output results, since you dont have to keep
> > > > some separate store that maps integer doc ids against friendly string
> > > > names, but I dont think that is really all that important.  For the
> > long
> > > > run, in my opinion I would definitely standardize on IntWritable for
> > > vector
> > > > keys.
> > > >
> > >
> > > Yeah, avoiding having a separate store / mapping for "docId ->
> > > documentName"
> > > or whatnot is a good reason to not normalize this field, but since we
> > > already have
> > > to do this for the terms, for efficiency's sake, keeping an extra
> mapping
> > > for docs
> > > is not so much of a big deal, IMO.   The only part in which this
> becomes
> > > annoying
> > > is that there aren't very many ints.  Longs might be better, sometimes.
> > >  Then again,
> > > *forcing* everyone to use big 8byte longs for stuff which easily fits
> in
> > > ints can be silly,
> > > and doing this for *both* row keys and column keys is wasting lots of
> > > space, but
> > > necessary for the idea of "transpose" or matrix multiplication to make
> > > sense.
> > >
> > >
> > > >
> > > > Thanks for the great explanation!
> > > >
> > > >
> > > No problem.
> > >
> > >  -jake
> > >
> > >
> > > > JohnC
> > > >
> > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[email protected]>
> > > > wrote:
> > > >
> > > > > In general, workflows with matrices in Mahout handle
> > > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk
> > > format
> > > > of
> > > > > the class DistributedRowMatrix.  The original Mahout LDA pre-dated
> > this
> > > > > move to standardize closer to that format, and so it didn't have
> that
> > > > > requirement.
> > > > >
> > > > > Now, as you say, it's true that in the new implementation, the keys
> > > > aren't
> > > > > actually
> > > > > used, so in principle we could just go with WritableComparable<?>
> in
> > > > > CVB0Driver's
> > > > > mappers/reducers keys.  In fact, it would make certain
> integrations a
> > > > > little nicer,
> > > > > at the cost of pushing incompatibility somewhere else.  For
> example,
> > > the
> > > > > output
> > > > > p(document | topic) distributions go into a SequenceFile whose keys
> > are
> > > > the
> > > > > same
> > > > > as the input corpus keys (ie the doc_id values), and there may be
> > > > workflows
> > > > > which
> > > > > take this matrix and transpose it to multiply it by another matrix
> or
> > > > > somethign of that
> > > > > nature.  If the keys are IntWritable, this all works just fine.  If
> > > not,
> > > > > then transpose
> > > > > will fail horribly, as will matrix multiplication.
> > > > >
> > > > > Standardizing on a common fixed format internally avoids some of
> > these
> > > > > problems,
> > > > > while at the same time being a bit inflexible.
> > > > >
> > > > > It's possible we could add a command-line option + some internal
> > > switches
> > > > > to allow
> > > > > the user to explicitly force untyped keys, or just warn on
> > non-integer
> > > > keys
> > > > > or
> > > > > something...
> > > > >
> > > > > I can just imagine seeing the questions on this very list when
> > someone
> > > > > takes the output
> > > > > of their Long-keyed corpus and try to matrix multiply it by some
> > other
> > > > > matrix...
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[email protected]>
> > wrote:
> > > > >
> > > > > > I wanted to compare the two LDA implementations, and I noticed
> that
> > > for
> > > > > the
> > > > > > input corpus sequence file file (key: doc_id, value: vector), the
> > Key
> > > > for
> > > > > > the input file for LDADriver takes any WritableComparable<?> key,
> > but
> > > > the
> > > > > > Key for the input file for CVB0Driver requires IntWritable
> > > explicitly.
> > > > >  Is
> > > > > > there some reason these two LDA implementations cant both use
> > > > > > WritableComparable<?> for the key of the input sequence file?  It
> > > would
> > > > > > make integrating them into application workflows much easier and
> > > > > > consistant.
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Thanks,
> > > > > > John C
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Thanks,
> > > > John C
> > > >
> > >
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Reply via email to