On Tue, Jul 26, 2011 at 4:27 AM, Benjamin Heilbrunn <[email protected]>wrote:
>
> 1) How can I display the topic distribution for a (existing) document
> from the reuters corpus?
>

There is a sequence file called docTopics in the output directory.  keys are
docIds,
values are VectorWritable.  Use "./bin/mahout vectordump -s <path to
docTopics>"
to print them out.


> 2) How can I compute the topic distribution for a new and unknown document?
>

This isn't hooked into the bin/mahout shell script, but it's an existing
java method:

LDADriver.computeDocumentTopicProbabilities(Configuration conf,
                                                        Path input,
                                                        Path stateIn,
                                                        Path outputPath,
                                                        int numTopics,
                                                        int numWords,
                                                        double
topicSmoothing)

the input path should be a sequencefile with values being VectorWritable
document instances, and stateIn should be the path to the final iteration of
the
topic model of the LDA iteration.  Make sure you used the same dictionary in
creating both the input and the topic model, or else you'll get nonsense.

  -jake
  • Mahout LDA Benjamin Heilbrunn
    • Re: Mahout LDA Jake Mannix

Reply via email to