So the bug I found results in the document topic model being trained on a random matrix as opposed to the final (term|topic probability) distributions. Unless a bug fix has been released this happens in all cases. At least for me. The result of which is a random (document|topic) model, with more or less uniform distributions. The term topic model works fine. As far as I can see this should be the case with everyone using the Hadoop distributed version unless a bug fix has been released.
This looks like the output from the (topic | document) distribution (due to the vectors being of size 10 and there being 10 topics) with the dictionary applied (which you should not do), not the (term | topic) distribution. This will therefore be uniform due to the bug. I will hopefully have posted a patch by the end of today as I am working on it now. Jack On 31 Jan 2013, at 14:37, Jake Mannix wrote: > Hi Thilina, > > The flag you missed on your vectordump commandline is the "--sort" > option, which sorts the results before taking the top k. Try that and send > us what that looks like? It should be much easier to interpret. > > > On Mon, Jan 7, 2013 at 7:19 AM, Thilina Gunarathne <[email protected]>wrote: > >> Dear All, >> I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news >> data set, as a sample for an Hadoop publications we are working on. I need >> some help in understanding the Maout output to figure out the topics. >> >> I ran the following commands on the TF vectors generated using seq2sparse >> command. >>> bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int >>> bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10 -x 20 -dict >> 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model >> >> After that I dumped the results using the vectordump as follows. >> >>> bin/mahout vectordump -i lda-topics/part-m-00000 --dictionary >> 20news-tf/dictionary.file-0 --vectorSize 10 -dt sequencefile >> ...... >> >> >> {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545} >> >> {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829} >> ....... >> >> It would be great if someone can help me to interpret the above results. >> The probability values seems to be more or less similar in all the cases. >> Is it due to the smaller size of the dataset? >> >> thanks, >> Thilina >> >> -- >> https://www.cs.indiana.edu/~tgunarat/ >> http://www.linkedin.com/in/thilina >> http://thilina.gunarathne.org >> > > > > -- > > -jake
