Folcon, I'm still using Mahout 0.6 so don't know much about changes in 0.7. Your output folder for "dt" looks correct. The relevant data would be in /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would be passing to a "-s" option. But I see it says size is only 97 so that looks suspicious. So you can just view file (for starters) as: mahout seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000. And the vector dumper command (as Jake pointed out) has a lot more options to post-process the data but you may want to first just see what is in that file. Dan
________________________________ From: Folcon Red <[email protected]> To: Jake Mannix <[email protected]> Cc: [email protected]; DAN HELM <[email protected]> Sent: Sunday, July 29, 2012 1:08 PM Subject: Re: Using Mahout to train an CVB and retrieve it's topics Hi Guys, Thanks for replying, the problem is whenever I use any -s flag I get the error "Unexpected -s while processing Job-Specific Options:" Also I'm not sure if this is supposed to be the output of -dt sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop starcluster Found 3 items -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 /user/sgeadmin/text_cvb_document/_SUCCESS drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 /user/sgeadmin/text_cvb_document/_logs -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 /user/sgeadmin/text_cvb_document/part-m-00000 Should I be using a newer version of mahout? I've just been using the 0.7 distribution so far as apparently the compiled versions are missing parts that the distributed ones have. Kind Regards, Folcon PS: Thanks for the help so far! On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: > > >On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote: > >Hi Folcon, >> >>In the folder you specified for the –dt option for cvb command >>there should be sequence files with the document to topic associations (Key: >>IntWritable, Value: VectorWritable). > > >Yeah, this is correct, although this: > > >You can dump in text format as: mahout seqdumper –s <sequence file> >> > > >is not as good as using vectordumper: > > > mahout vectordump -s <sequence file> --dictionary <path >to dictionary.file-0> \ > --dictionaryType seqfile --vectorSize <num entries per topic you want >to see> -sort > > >This joins your topic vectors with the dictionary, then picks out the top k >terms (with their >probabilities) for each topic and prints them to the console (or to the file >you specify with >an --output option). > > >*although* I notice now that in trunk when I just checked, VectorDumper.java >had a bug >in it for "vectorSize" - line 175 asks for cmdline option >"numIndexesPerVector" not >vectorSize, ack! So I took the liberty of fixing that, but you'll need to >"svn up" and rebuild >your jar before using vectordump like this. > >So in text output from seqdumper, the key is a document id and the vector >contains >>the topics and associated scores associated with the document. I think >>all topics are listed for each >>document but many with near zero score. >>In my case I used rowid to convert keys of original sparse >>document vectors from Text to Integer before running cvb and this generates a >>mapping file so I know the textual >>keys that correspond to the numeric document ids (since my original document >>ids were file names and I created named vectors). >>Hope this helps. >>Dan >> >> ________________________________ >> >> From: Folcon <[email protected]> >>To: [email protected] >>Sent: Saturday, July 28, 2012 8:28 PM >>Subject: Using Mahout to train an CVB and retrieve it's topics >> >> >>Hi Everyone, >> >>I'm posting this as my original message did not seem to appear on the mailing >>list, I'm very sorry if I have done this in error. >> >>I'm doing this to then use the topics to train a maxent algorithm to predict >>the >>classes of documents given their topic mixtures. Any further aid in this >>direction would be appreciated! >> >>I've been trying to extract the topics out of my run of cvb. Here's what I did >>so far. >> >>Ok, so I still don't know how to output the topics, but I have worked out how >>to >>get the cvb and what I think are the document vectors, however I'm not having >>any luck dumping them, so help here would still be appreciated! >> >>I set the values of: >> export MAHOUT_HOME=/home/sgeadmin/mahout >> export HADOOP_HOME=/usr/lib/hadoop >> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk >> export HADOOP_CONF_DIR=$HADOOP_HOME/conf >>on the master otherwise none of this works. >> >>So first I uploaded the documents using starclusters put: >> starcluster put mycluster text_train /home/sgeadmin/ >> starcluster put mycluster text_test /home/sgeadmin/ >> >>Then I added them to hadoop's hbase filesystem: >> dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster >> >>Then I called Mahout's seqdirectory to turn the text into sequence files >> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train -- >>output /user/sgeadmin/text_seq -c UTF-8 -ow >> >>Then I called Mahout's seq2parse to turn them into vectors >> $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec >>- >>wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow >> >>Finally I called cvb, I believe that the -dt flag states where the inferred >>topics should go, but because I haven't yet been able to dump them I can't >>confirm this. >> $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o >>/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict >>/user/sgeadmin/text_vec/dictionary.file-0 -dt >>/user/sgeadmin/text_cvb_document - >>mt /user/sgeadmin/text_states >> >>The -k flag is the number of topics, the -nt flag is the size of the >>dictionary, >>I computed this by counting the number of entries of the dictionary.file-0 >>inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the >>number of iterations. >> >>If you know how to get what the document topic probabilities are from here, >>help >>would be most appreciated! >> >>Kind Regards, >>Folcon > > > >-- > > > -jake >
