On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[email protected]> wrote:

> Just ran into a problem trying to use IntWritable as my key when creating
> vectors so I can use CBV0Driver.  I'm using the helper
> class SparseVectorsFromSequenceFiles to create my document vectors, and I
> create my sequence file with IntWritable as the key.
>  SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the
> documents, but, DocumentProcessor's output is key: Text, value:
> StringTuple.  This in turn causes an exception.
>
> So it looks like these helper classes that create sequence files of
> VectorWritable, which are the input to a lot of these algorithms, are not
> compatible with some of the newer algorithms, like CBV0Driver.  Is that
> correct?
>

$MAHOUT_HOME/bin/mahout rowid --help

to the rescue. :)


>
> And coming back to CBV0Driver, if someone wants to use it, they'll have to
> hand code the creation of VectorWritables, or possibly run the ones that
> are created by SparseVectorsFromSequenceFiles through a transform, to
> output IntWritable keys.  Correct?
>
> BTW, not trying to sound critical, I'm just trying to understand the
> architecture.  Is this an issue that you guys are want to get
> fixed/consistant at some point?  Where all vector keys are IntWritables,
> and all helper classes consume and output pairs that have IntWritable keys?
>  I might be interested in helping with that effort.
>
> Thanks,
> JohnC
>
>
> On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[email protected]>
> wrote:
>
> > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[email protected]> wrote:
> >
> > > Hi Jake,
> > > Thanks for the explanation.  I actually prefer using ints as key
> > > identifiers globally, vs a string.  It can help keep memory and gc
> > > utilization way down, especially in algorithms that have high iteration
> > >  counts.
> > >
> > > I had gone through an example that used the original LDA algorithm, and
> > the
> > > samples used the filename as the document key, vs some kind of integer
> > > identifier, so I just went with that.  It does make things easier when
> > > looking at your output results, since you dont have to keep
> > > some separate store that maps integer doc ids against friendly string
> > > names, but I dont think that is really all that important.  For the
> long
> > > run, in my opinion I would definitely standardize on IntWritable for
> > vector
> > > keys.
> > >
> >
> > Yeah, avoiding having a separate store / mapping for "docId ->
> > documentName"
> > or whatnot is a good reason to not normalize this field, but since we
> > already have
> > to do this for the terms, for efficiency's sake, keeping an extra mapping
> > for docs
> > is not so much of a big deal, IMO.   The only part in which this becomes
> > annoying
> > is that there aren't very many ints.  Longs might be better, sometimes.
> >  Then again,
> > *forcing* everyone to use big 8byte longs for stuff which easily fits in
> > ints can be silly,
> > and doing this for *both* row keys and column keys is wasting lots of
> > space, but
> > necessary for the idea of "transpose" or matrix multiplication to make
> > sense.
> >
> >
> > >
> > > Thanks for the great explanation!
> > >
> > >
> > No problem.
> >
> >  -jake
> >
> >
> > > JohnC
> > >
> > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[email protected]>
> > > wrote:
> > >
> > > > In general, workflows with matrices in Mahout handle
> > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk
> > format
> > > of
> > > > the class DistributedRowMatrix.  The original Mahout LDA pre-dated
> this
> > > > move to standardize closer to that format, and so it didn't have that
> > > > requirement.
> > > >
> > > > Now, as you say, it's true that in the new implementation, the keys
> > > aren't
> > > > actually
> > > > used, so in principle we could just go with WritableComparable<?> in
> > > > CVB0Driver's
> > > > mappers/reducers keys.  In fact, it would make certain integrations a
> > > > little nicer,
> > > > at the cost of pushing incompatibility somewhere else.  For example,
> > the
> > > > output
> > > > p(document | topic) distributions go into a SequenceFile whose keys
> are
> > > the
> > > > same
> > > > as the input corpus keys (ie the doc_id values), and there may be
> > > workflows
> > > > which
> > > > take this matrix and transpose it to multiply it by another matrix or
> > > > somethign of that
> > > > nature.  If the keys are IntWritable, this all works just fine.  If
> > not,
> > > > then transpose
> > > > will fail horribly, as will matrix multiplication.
> > > >
> > > > Standardizing on a common fixed format internally avoids some of
> these
> > > > problems,
> > > > while at the same time being a bit inflexible.
> > > >
> > > > It's possible we could add a command-line option + some internal
> > switches
> > > > to allow
> > > > the user to explicitly force untyped keys, or just warn on
> non-integer
> > > keys
> > > > or
> > > > something...
> > > >
> > > > I can just imagine seeing the questions on this very list when
> someone
> > > > takes the output
> > > > of their Long-keyed corpus and try to matrix multiply it by some
> other
> > > > matrix...
> > > >
> > > >  -jake
> > > >
> > > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[email protected]>
> wrote:
> > > >
> > > > > I wanted to compare the two LDA implementations, and I noticed that
> > for
> > > > the
> > > > > input corpus sequence file file (key: doc_id, value: vector), the
> Key
> > > for
> > > > > the input file for LDADriver takes any WritableComparable<?> key,
> but
> > > the
> > > > > Key for the input file for CVB0Driver requires IntWritable
> > explicitly.
> > > >  Is
> > > > > there some reason these two LDA implementations cant both use
> > > > > WritableComparable<?> for the key of the input sequence file?  It
> > would
> > > > > make integrating them into application workflows much easier and
> > > > > consistant.
> > > > >
> > > > > --
> > > > >
> > > > > Thanks,
> > > > > John C
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>

Reply via email to