Hi Andy If I only use the -s and -o options I get this null pointer exception:
Exception in thread "main" java.lang.NullPointerException at org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:118) at org.apache.mahout.utils.vectors.VectorHelper$1.apply(VectorHelper.java:115) at com.google.common.collect.Iterators$8.next(Iterators.java:765) at java.util.AbstractCollection.toArray(AbstractCollection.java:124) at java.util.ArrayList.<init>(ArrayList.java:131) at com.google.common.collect.Lists.newArrayList(Lists.java:119) at org.apache.mahout.utils.vectors.VectorHelper.toWeightedTerms(VectorHelper.java:114) at org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:124) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:241) In the code it looks like it is looking for a dictionary that is not specified. Is there another option i am missing? Cheers, Caroline On Wed, Jul 4, 2012 at 6:10 PM, Andy Schlaikjer <[email protected] > wrote: > Hi Caroline, > > Jake Mannix and I wrote the LDA CVB implementation. Apologies for the light > documentation. > > When you invoked Mahout, did you supply the "--doc_topic_output <path>" > parameter? If this is present, after training a model the driver app will > apply the model to the input term-vectors, storing inference results in the > specified path. If the parameter isn't specified, this final inference run > is skipped: > > > https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L74 > > https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java#L331 > > So, assuming you did generate inference output, I should note that both the > model and inference output have the *same* format: Both the topic-term > matrix and doc-topic inference output are stored as > SequenceFile<IntWritable, VectorWritable> data. If you point the vectordump > util at either data set and supply a dictionary, it'll happily map term ids > or topic ids into term strings using that dictionary... Quite confusing. > Just make sure that when you run vectordump against the doc-topic data that > you don't supply the dictionary-- This way, you'll see the raw topic ids > (zero-based indices) in output, instead of whatever terms those indices > might correspond to in your dictionary. > > Best, > Andy > @sagemintblue > > > On Wed, Jul 4, 2012 at 2:30 AM, Caroline Meyer <[email protected] > >wrote: > > > Hey Guys, > > > > I have been able to successfully execute the new lda algorithm as well as > > extract the topic/term inference with vectordump. What I was not able to > do > > was get the document/topic inference. When I run the same vectordump > > command I get the same kinds of vectors (term:probability) as before. > > Should the vectors not be (topic:probability)? > > > > The command I run is: > > > > vectordump -s temp/lda-cvb-doc/part-m-00000 -d > > temp/vectors/dictionary.file-* -dt sequencefile -o > temp/lda-cvb-topics.txt > > > > I have not been able to find any documentation except what's in the code. > > Thanks for the help. > > > > Cheers, > > Caroline > > >
