Hi Caroline,

Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light
documentation.

When you invoked Mahout, did you supply the "--doc_topic_output <path>"
parameter? If this is present, after training a model the driver app will
apply the model to the input term-vectors, storing inference results in the
specified path. If the parameter isn't specified, this final inference run
is skipped:

https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74
https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331

So, assuming you did generate inference output, I should note that both the
model and inference output have the *same* format: Both the topic-term
matrix and doc-topic inference output are stored as
SequenceFile<IntWritable, VectorWritable> data. If you point the vectordump
util at either data set and supply a dictionary, it'll happily map term ids
or topic ids into term strings using that dictionary... Quite confusing.
Just make sure that when you run vectordump against the doc-topic data that
you don't supply the dictionary-- This way, you'll see the raw topic ids
(zero-based indices) in output, instead of whatever terms those indices
might correspond to in your dictionary.

Best,
Andy
@sagemintblue


On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer <[email protected]>wrote:

> Hey Guys,
>
> I have been able to successfully execute the new lda algorithm as well as
> extract the topic/term inference with vectordump. What I was not able to do
> was get the document/topic inference. When I run the same vectordump
> command I get the same kinds of vectors (term:probability) as before.
> Should the vectors not be (topic:probability)?
>
> The command I run is:
>
> vectordump -s temp/lda-cvb-doc/part-m-00000 -d
> temp/vectors/dictionary.file-* -dt sequencefile -o temp/lda-cvb-topics.txt
>
> I have not been able to find any documentation except what's in the code.
> Thanks for the help.
>
> Cheers,
> Caroline
>

Reply via email to