Thanks Dan and Jake, The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/sgeadmin /text_cvb_document/part-m-00000 is:
Input Path: /user/sgeadmin/text_cvb_document/part-m-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Count: 0 I'm not certain what went wrong. Kind Regards, Folcon On 29 July 2012 18:49, DAN HELM <[email protected]> wrote: > Folcon, > > I'm still using Mahout 0.6 so don't know much about changes in 0.7. > > Your output folder for "dt" looks correct. The relevant data would be > in /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would > be passing to a "-s" option. But I see it says size is only 97 so that > looks suspicious. So you can just view file (for starters) as: mahout > seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000. And the > vector dumper command (as Jake pointed out) has a lot more options to > post-process > the data but you may want to first just see what is in that file. > > Dan > > *From:* Folcon Red <[email protected]> > *To:* Jake Mannix <[email protected]> > *Cc:* [email protected]; DAN HELM <[email protected]> > *Sent:* Sunday, July 29, 2012 1:08 PM > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics > > Hi Guys, > > Thanks for replying, the problem is whenever I use any -s flag I get the > error "Unexpected -s while processing Job-Specific Options:" > > Also I'm not sure if this is supposed to be the output of -dt > > sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop > starcluster > Found 3 items > -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 /user/ > sgeadmin/text_cvb_document/_SUCCESS > drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 /user/ > sgeadmin/text_cvb_document/_logs > -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 /user/ > sgeadmin/text_cvb_document/part-m-00000 > > Should I be using a newer version of mahout? I've just been using the 0.7 > distribution so far as apparently the compiled versions are missing parts > that the distributed ones have. > > Kind Regards, > Folcon > > PS: Thanks for the help so far! > > On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: > > > > On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote: > > Hi Folcon, > > In the folder you specified for the –dt option for cvb command > there should be sequence files with the document to topic associations > (Key: > IntWritable, Value: VectorWritable). > > > Yeah, this is correct, although this: > > > You can dump in text format as: mahout seqdumper –s <sequence file> > > > is not as good as using vectordumper: > > mahout vectordump -s <sequence file> --dictionary <path to > dictionary.file-0> > \ > --dictionaryType seqfile --vectorSize <num entries per topic you > want to see> -sort > > This joins your topic vectors with the dictionary, then picks out the top > k terms (with their > probabilities) for each topic and prints them to the console (or to the > file you specify with > an --output option). > > *although* I notice now that in trunk when I just checked, VectorDumper.java > had a bug > in it for "vectorSize" - line 175 asks for cmdline option " > numIndexesPerVector" not > vectorSize, ack! So I took the liberty of fixing that, but you'll need > to "svn up" and rebuild > your jar before using vectordump like this. > > > So in text output from seqdumper, the key is a document id and the > vector contains > the topics and associated scores associated with the document. I think > all topics are listed for each > document but many with near zero score. > In my case I used rowid to convert keys of original sparse > document vectors from Text to Integer before running cvb and this > generates a mapping file so I know the textual > keys that correspond to the numeric document ids (since my original > document ids were file names and I created named vectors). > Hope this helps. > Dan > > ________________________________ > > From: Folcon <[email protected]> > To: [email protected] > Sent: Saturday, July 28, 2012 8:28 PM > Subject: Using Mahout to train an CVB and retrieve it's topics > > Hi Everyone, > > I'm posting this as my original message did not seem to appear on the > mailing > list, I'm very sorry if I have done this in error. > > I'm doing this to then use the topics to train a maxent algorithm to > predict the > classes of documents given their topic mixtures. Any further aid in this > direction would be appreciated! > > I've been trying to extract the topics out of my run of cvb. Here's what > I did > so far. > > Ok, so I still don't know how to output the topics, but I have worked out > how to > get the cvb and what I think are the document vectors, however I'm not > having > any luck dumping them, so help here would still be appreciated! > > I set the values of: > export MAHOUT_HOME=/home/sgeadmin/mahout > export HADOOP_HOME=/usr/lib/hadoop > export JAVA_HOME=/usr/lib/jvm/java-6-openjdk > export HADOOP_CONF_DIR=$HADOOP_HOME/conf > on the master otherwise none of this works. > > So first I uploaded the documents using starclusters put: > starcluster put mycluster text_train /home/sgeadmin/ > starcluster put mycluster text_test /home/sgeadmin/ > > Then I added them to hadoop's hbase filesystem: > dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop > starcluster > > Then I called Mahout's seqdirectory to turn the text into sequence files > $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train > -- > output /user/sgeadmin/text_seq -c UTF-8 -ow > > Then I called Mahout's seq2parse to turn them into vectors > $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_ > vec - > wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow > > Finally I called cvb, I believe that the -dt flag states where the > inferred > topics should go, but because I haven't yet been able to dump them I can't > confirm this. > $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o > /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict > /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document > - > mt /user/sgeadmin/text_states > > The -k flag is the number of topics, the -nt flag is the size of the > dictionary, > I computed this by counting the number of entries of the dictionary.file-0 > inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is > the > number of iterations. > > If you know how to get what the document topic probabilities are from > here, help > would be most appreciated! > > Kind Regards, > Folcon > > > > > -- > > -jake > > > > > >
