Ok this output looks more like what you should be getting, in the sense that there are some common words with similar global frequencies toward the beginning of the lists, but the question as to why you only got 2 topics is very odd.
Can you list the contents of the cvb-output directories, and the commands you gave to vectordump them? Running vectordump over the doc-topic output will be strange unless you leave off the dictionary part, so that you can just list the topicIds with weights for each document (but even then, you'll want to be able to know what each of those topics really are about, which is what we're working on getting for you now). On Fri, Apr 19, 2013 at 6:58 AM, Chris Harrington <[email protected]>wrote: > That output was from the cvb-topic-doc > > Just ran it over the cvb-output and got > > > {how:0.04017873894220033,you:0.02145053718356672,your:0.015258172557645477,open:0.013308164155725721,use:0.012769472622541729,search:0.011104235407141033,web:0.011054257715475113,up:0.008681792828858947,do:0.007301515628267762,install:0.00724558241579199} > > {ruby:0.013622365167924151,search:0.009151768347011088,ucd:0.009024998353413487,street:0.008153757865114717,information:0.008010912906951214,symfony2:0.007610946012031929,college:0.007488257104426453,get:0.006831742925644331,form:0.006798612982669133,us:0.006420604369821055} > > There was only the 2 rows as above, so going by your last mail it only > found 2 topics? > > If so how come it only output 2 topics when I gave it -k 20 as a parameter > > > > On 19 Apr 2013, at 13:55, David LaBarbera wrote: > > > Did you run vectordump with the lda output directory (cvb-output in your > case) or document topic output (cvb-topic-doc)? > > Depending on which you're looking at, you'll have > > > > lda output: > > each row corresponds to a topic and the elements are (term > index:probability). The terms correspond to what's in the dictionary > (contentDataDir/sparseVectors/dictionary.file-0). You can add the > dictionary to the command line, so the output will be (term:probability). > The flag should be > > --dictionary ./contentDataDir/sparseVectors/dictionary.file-0 > --dictionaryType sequencefile > > > > dt output: > > each row is a document and the elements are (topic:probability). > > > > > > David > > > > On Apr 19, 2013, at 8:30 AM, Chris Harrington <[email protected]> > wrote: > > > >> Just ran vectordump over the output from cub but I have no idea what > I'm looking at > >> > >> > {1.0:0.0689751034234147,0hu:0.052798138507741114,06:0.046108327846619585,091:0.04079964524901706,1:0.03488226667358313,10g:0.03471651100042406,07:0.03051583712303273,10.30am:0.029957963431693112,1171:0.028424194208528646,10.4.10:0.028173810240271588} > >> > >> Can someone give me an explanation of the above > >> > >> In the Mahout in Action book there was a table which displayed topic > with top terms, how would I go from the above to something like that. i.e. > >> topic 0 -> term1, term2 term3….termN > >> topic 1 -> term1, term2 term3….termN > >> etc. > >> > >> > >> On 19 Apr 2013, at 10:19, Chris Harrington wrote: > >> > >>> Found the issue it was the folder I gave it for outputting the matrix > in the rowed command, for cvb I gave it the ./contentDataDir/matrix as the > matrix location instead I should have supplied > ./contentDataDir/martrix/matrix > >>> > >>> On 17 Apr 2013, at 12:46, Chris Harrington wrote: > >>> > >>>> So I've got 0.8 now but I'm running into an error, > >>>> > >>>> ../../workspace2/trunk/bin/mahout seqdirectory -i > ./contentDataDir/output-content-segment -o ./contentDataDir/sequenced > >>>> > >>>> ../../workspace2/trunk/bin/mahout seq2sparse -i > ./contentDataDir/sequenced -o ./contentDataDir/sparseVectors --namedVector > -wt tf > >>>> > >>>> ../../workspace2/trunk/bin/mahout rowid -i > ./contentDataDir/sparseVectors/tf-vectors/ -o ./contentDataDir/matrix > >>>> > >>>> ../../workspace2/trunk/bin/mahout cvb -i ./contentDataDir/matrix -o > cvb-output -k 100 -x 1 -dict > ./contentDataDir/sparseVectors/dictionary.file-0 -dt cvb-topic-doc -mt > cvb-topic-model > >>>> > >>>> but the cvb command hits a class cast exception > >>>> > >>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > cast to org.apache.mahout.math.VectorWritable > >>>> at > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > >>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > >>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > >>>> at java.security.AccessController.doPrivileged(Native Method) > >>>> at javax.security.auth.Subject.doAs(Subject.java:396) > >>>> at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) > >>>> at org.apache.hadoop.mapred.Child.main(Child.java:249) > >>>> > >>>> I thought the seq2sparse took care of turning hadoop Text into > mahouts VectorWritable. Where have I gone wrong? > >>>> > >>>> > >>>> > >>>> On 16 Apr 2013, at 14:45, Jake Mannix wrote: > >>>> > >>>>> You should just be building off of trunk (0.8-snapshot) in which > case you > >>>>> should be working just fine. > >>>>> > >>>>> > >>>>> On Tue, Apr 16, 2013 at 6:43 AM, Chris Harrington < > [email protected]>wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> I've been trying to get the vector dumper to work on the output > from cub > >>>>>> but it's throwing lots of errors. > >>>>>> > >>>>>> I found several old mails on the mailing list regrading this issue > >>>>>> specifically this > >>>>>> > >>>>>> > >>>>>> > http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3CCAHSfFsy2oWRuzwVzGW57LRYaJ+LuudNu-W5EO0wnV_ff=uy...@mail.gmail.com%3E > >>>>>> > >>>>>> That thread is a bit old so I was wondering was there a patch or > anything > >>>>>> to fix it or do I need to use the 0.8-snapshot? > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> -jake > >>>> > >>> > >> > > > > -- -jake
