Did you run vectordump with the lda output directory (cvb-output in your case) or document topic output (cvb-topic-doc)? Depending on which you're looking at, you'll have
lda output: each row corresponds to a topic and the elements are (term index:probability). The terms correspond to what's in the dictionary (contentDataDir/sparseVectors/dictionary.file-0). You can add the dictionary to the command line, so the output will be (term:probability). The flag should be --dictionary ./contentDataDir/sparseVectors/dictionary.file-0 --dictionaryType sequencefile dt output: each row is a document and the elements are (topic:probability). David On Apr 19, 2013, at 8:30 AM, Chris Harrington <[email protected]> wrote: > Just ran vectordump over the output from cub but I have no idea what I'm > looking at > > {1.0:0.0689751034234147,0hu:0.052798138507741114,06:0.046108327846619585,091:0.04079964524901706,1:0.03488226667358313,10g:0.03471651100042406,07:0.03051583712303273,10.30am:0.029957963431693112,1171:0.028424194208528646,10.4.10:0.028173810240271588} > > Can someone give me an explanation of the above > > In the Mahout in Action book there was a table which displayed topic with top > terms, how would I go from the above to something like that. i.e. > topic 0 -> term1, term2 term3….termN > topic 1 -> term1, term2 term3….termN > etc. > > > On 19 Apr 2013, at 10:19, Chris Harrington wrote: > >> Found the issue it was the folder I gave it for outputting the matrix in the >> rowed command, for cvb I gave it the ./contentDataDir/matrix as the matrix >> location instead I should have supplied ./contentDataDir/martrix/matrix >> >> On 17 Apr 2013, at 12:46, Chris Harrington wrote: >> >>> So I've got 0.8 now but I'm running into an error, >>> >>> ../../workspace2/trunk/bin/mahout seqdirectory -i >>> ./contentDataDir/output-content-segment -o ./contentDataDir/sequenced >>> >>> ../../workspace2/trunk/bin/mahout seq2sparse -i ./contentDataDir/sequenced >>> -o ./contentDataDir/sparseVectors --namedVector -wt tf >>> >>> ../../workspace2/trunk/bin/mahout rowid -i >>> ./contentDataDir/sparseVectors/tf-vectors/ -o ./contentDataDir/matrix >>> >>> ../../workspace2/trunk/bin/mahout cvb -i ./contentDataDir/matrix -o >>> cvb-output -k 100 -x 1 -dict >>> ./contentDataDir/sparseVectors/dictionary.file-0 -dt cvb-topic-doc -mt >>> cvb-topic-model >>> >>> but the cvb command hits a class cast exception >>> >>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >>> org.apache.mahout.math.VectorWritable >>> at >>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:396) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) >>> at org.apache.hadoop.mapred.Child.main(Child.java:249) >>> >>> I thought the seq2sparse took care of turning hadoop Text into mahouts >>> VectorWritable. Where have I gone wrong? >>> >>> >>> >>> On 16 Apr 2013, at 14:45, Jake Mannix wrote: >>> >>>> You should just be building off of trunk (0.8-snapshot) in which case you >>>> should be working just fine. >>>> >>>> >>>> On Tue, Apr 16, 2013 at 6:43 AM, Chris Harrington >>>> <[email protected]>wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've been trying to get the vector dumper to work on the output from cub >>>>> but it's throwing lots of errors. >>>>> >>>>> I found several old mails on the mailing list regrading this issue >>>>> specifically this >>>>> >>>>> >>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3CCAHSfFsy2oWRuzwVzGW57LRYaJ+LuudNu-W5EO0wnV_ff=uy...@mail.gmail.com%3E >>>>> >>>>> That thread is a bit old so I was wondering was there a patch or anything >>>>> to fix it or do I need to use the 0.8-snapshot? >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> -jake >>> >> >
