On Sun, Jul 29, 2012 at 10:08 AM, Folcon Red <[email protected]> wrote:
> Hi Guys, > > Thanks for replying, the problem is whenever I use any -s flag I get the > error "Unexpected -s while processing Job-Specific Options:" > -s is the old way of doing input (short for "sequencefile"), it's now --input or -i > > Also I'm not sure if this is supposed to be the output of -dt > > sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop > starcluster > Found 3 items > -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 > /user/sgeadmin/text_cvb_document/_SUCCESS > drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 > /user/sgeadmin/text_cvb_document/_logs > -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 > /user/sgeadmin/text_cvb_document/part-m-00000 > > Should I be using a newer version of mahout? I've just been using the 0.7 > distribution so far as apparently the compiled versions are missing parts > that the distributed ones have. > > Kind Regards, > Folcon > > PS: Thanks for the help so far! > > On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: > >> >> >> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote: >> >>> Hi Folcon, >>> >>> In the folder you specified for the –dt option for cvb command >>> there should be sequence files with the document to topic associations >>> (Key: >>> IntWritable, Value: VectorWritable). >> >> >> Yeah, this is correct, although this: >> >> >>> You can dump in text format as: mahout seqdumper –s <sequence file> >>> >> >> is not as good as using vectordumper: >> >> mahout vectordump -s <sequence file> --dictionary <path to >> dictionary.file-0> >> \ >> --dictionaryType seqfile --vectorSize <num entries per topic you >> want to see> -sort >> >> This joins your topic vectors with the dictionary, then picks out the top >> k terms (with their >> probabilities) for each topic and prints them to the console (or to the >> file you specify with >> an --output option). >> >> *although* I notice now that in trunk when I just checked, >> VectorDumper.java had a bug >> in it for "vectorSize" - line 175 asks for cmdline option >> "numIndexesPerVector" >> not >> vectorSize, ack! So I took the liberty of fixing that, but you'll need >> to "svn up" and rebuild >> your jar before using vectordump like this. >> >> >>> So in text output from seqdumper, the key is a document id and the >>> vector contains >>> the topics and associated scores associated with the document. I think >>> all topics are listed for each >>> document but many with near zero score. >>> In my case I used rowid to convert keys of original sparse >>> document vectors from Text to Integer before running cvb and this >>> generates a mapping file so I know the textual >>> keys that correspond to the numeric document ids (since my original >>> document ids were file names and I created named vectors). >>> Hope this helps. >>> Dan >>> >>> ________________________________ >>> >>> From: Folcon <[email protected]> >>> To: [email protected] >>> Sent: Saturday, July 28, 2012 8:28 PM >>> Subject: Using Mahout to train an CVB and retrieve it's topics >>> >>> Hi Everyone, >>> >>> I'm posting this as my original message did not seem to appear on the >>> mailing >>> list, I'm very sorry if I have done this in error. >>> >>> I'm doing this to then use the topics to train a maxent algorithm to >>> predict the >>> classes of documents given their topic mixtures. Any further aid in this >>> direction would be appreciated! >>> >>> I've been trying to extract the topics out of my run of cvb. Here's what >>> I did >>> so far. >>> >>> Ok, so I still don't know how to output the topics, but I have worked >>> out how to >>> get the cvb and what I think are the document vectors, however I'm not >>> having >>> any luck dumping them, so help here would still be appreciated! >>> >>> I set the values of: >>> export MAHOUT_HOME=/home/sgeadmin/mahout >>> export HADOOP_HOME=/usr/lib/hadoop >>> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk >>> export HADOOP_CONF_DIR=$HADOOP_HOME/conf >>> on the master otherwise none of this works. >>> >>> So first I uploaded the documents using starclusters put: >>> starcluster put mycluster text_train /home/sgeadmin/ >>> starcluster put mycluster text_test /home/sgeadmin/ >>> >>> Then I added them to hadoop's hbase filesystem: >>> dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop >>> starcluster >>> >>> Then I called Mahout's seqdirectory to turn the text into sequence files >>> $MAHOUT_HOME/bin/mahout seqdirectory --input >>> /user/sgeadmin/text_train -- >>> output /user/sgeadmin/text_seq -c UTF-8 -ow >>> >>> Then I called Mahout's seq2parse to turn them into vectors >>> $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o >>> /user/sgeadmin/text_vec - >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow >>> >>> Finally I called cvb, I believe that the -dt flag states where the >>> inferred >>> topics should go, but because I haven't yet been able to dump them I >>> can't >>> confirm this. >>> $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt >>> /user/sgeadmin/text_cvb_document - >>> mt /user/sgeadmin/text_states >>> >>> The -k flag is the number of topics, the -nt flag is the size of the >>> dictionary, >>> I computed this by counting the number of entries of the >>> dictionary.file-0 >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is >>> the >>> number of iterations. >>> >>> If you know how to get what the document topic probabilities are from >>> here, help >>> would be most appreciated! >>> >>> Kind Regards, >>> Folcon >>> >> >> >> >> -- >> >> -jake >> >> > -- -jake
