Hi Jake,
Thanks for the explanation.  I actually prefer using ints as key
identifiers globally, vs a string.  It can help keep memory and gc
utilization way down, especially in algorithms that have high iteration
 counts.

I had gone through an example that used the original LDA algorithm, and the
samples used the filename as the document key, vs some kind of integer
identifier, so I just went with that.  It does make things easier when
looking at your output results, since you dont have to keep
some separate store that maps integer doc ids against friendly string
names, but I dont think that is really all that important.  For the long
run, in my opinion I would definitely standardize on IntWritable for vector
keys.

Thanks for the great explanation!

JohnC

On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[email protected]> wrote:

> In general, workflows with matrices in Mahout handle
> SequenceFile<IntWritable, VectorWritable>, as this is the on-disk format of
> the class DistributedRowMatrix.  The original Mahout LDA pre-dated this
> move to standardize closer to that format, and so it didn't have that
> requirement.
>
> Now, as you say, it's true that in the new implementation, the keys aren't
> actually
> used, so in principle we could just go with WritableComparable<?> in
> CVB0Driver's
> mappers/reducers keys.  In fact, it would make certain integrations a
> little nicer,
> at the cost of pushing incompatibility somewhere else.  For example, the
> output
> p(document | topic) distributions go into a SequenceFile whose keys are the
> same
> as the input corpus keys (ie the doc_id values), and there may be workflows
> which
> take this matrix and transpose it to multiply it by another matrix or
> somethign of that
> nature.  If the keys are IntWritable, this all works just fine.  If not,
> then transpose
> will fail horribly, as will matrix multiplication.
>
> Standardizing on a common fixed format internally avoids some of these
> problems,
> while at the same time being a bit inflexible.
>
> It's possible we could add a command-line option + some internal switches
> to allow
> the user to explicitly force untyped keys, or just warn on non-integer keys
> or
> something...
>
> I can just imagine seeing the questions on this very list when someone
> takes the output
> of their Long-keyed corpus and try to matrix multiply it by some other
> matrix...
>
>  -jake
>
> On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[email protected]> wrote:
>
> > I wanted to compare the two LDA implementations, and I noticed that for
> the
> > input corpus sequence file file (key: doc_id, value: vector), the Key for
> > the input file for LDADriver takes any WritableComparable<?> key, but the
> > Key for the input file for CVB0Driver requires IntWritable explicitly.
>  Is
> > there some reason these two LDA implementations cant both use
> > WritableComparable<?> for the key of the input sequence file?  It would
> > make integrating them into application workflows much easier and
> > consistant.
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Reply via email to