So part-r-00000 inside text_vec is still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable even after moving all the training files into a single folder.
Regards, Folcon On 31 July 2012 18:18, Folcon Red <[email protected]> wrote: > Hey Everyone, > > Ok not certain why $MAHOUT_HOME/bin/mahout seqdirectory --input /user/ > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow didn't > produce sequence files, just looking inside text_seq only gives me: > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text > > and that's it. Any ideas what I've been doing wrong? Maybe it's because I > have the files nested in the folder by class, for example a tree view of > the directory would look like. > > text_train -+ > | A -+ > | 100 > | 101 > | 103 > | B -+ > | 102 > | 105 > | 106 > > So it's not picking them up? Or perhaps something else? I'm going to try > some variations to see what happens. > > Thanks for the help so far! > > Regards, > Folcon > > > On 29 July 2012 22:10, Folcon Red <[email protected]> wrote: > >> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced: >> >> >> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN} >> >> And $MAHOUT_HOME/bin/mahout seqdumper -i >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced: >> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments: >> {--endPhase=[2147483647], >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000], >> --startPhase=[0], --tempDir=[temp]} >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000 >> Key class: class org.apache.hadoop.io.Text Value Class: class >> org.apache.mahout.math.VectorWritable >> Count: 0 >> >> Kind Regards, >> Folcon >> >> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote: >> >>> Yep something went wrong, most likely with the clustering. part file is >>> empty. Should look something like this: >>> >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >>> org.apache.mahout.math.VectorWritable >>> Key: 0: Value: >>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457} >>> Key: 1: Value: >>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756} >>> Key: 2: Value: >>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6} >>> ... >>> ... >>> >>> Key refers to a document id and the Value are topic ids:weights assigned >>> to document id. >>> >>> So you need to figure out where things went wrong. I'm assume folder >>> /user/sgeadmin/text_lda also has empty part files? Assuming parts >>> files are there run seqdumper on one. Should have data like the above >>> except in this case the key will be a topic id and the vector will be term >>> ids:weights. >>> >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make >>> sure sparse vectors were generated for your input to cvb. >>> >>> Dan >>> >>> *From:* Folcon Red <[email protected]> >>> *To:* DAN HELM <[email protected]> >>> *Cc:* Jake Mannix <[email protected]>; "[email protected]" < >>> [email protected]> >>> *Sent:* Sunday, July 29, 2012 3:35 PM >>> >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics >>> >>> Thanks Dan and Jake, >>> >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/ >>> sgeadmin/text_cvb_document/part-m-00000 is: >>> >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000 >>> Key class: class org.apache.hadoop.io <http://org.apache.hadoop.io.int/> >>> .IntWritable Value Class: class org.apache.mahout.math.VectorWritable >>> Count: 0 >>> >>> I'm not certain what went wrong. >>> >>> Kind Regards, >>> Folcon >>> >>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote: >>> >>> Folcon, >>> >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7. >>> >>> Your output folder for "dt" looks correct. The relevant data would be >>> in /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would >>> be passing to a "-s" option. But I see it says size is only 97 so that >>> looks suspicious. So you can just view file (for starters) as: mahout >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000. And the >>> vector dumper command (as Jake pointed out) has a lot more options to >>> post-process the data but you may want to first just see what is in >>> that file. >>> >>> Dan >>> >>> *From:* Folcon Red <[email protected]> >>> *To:* Jake Mannix <[email protected]> >>> *Cc:* [email protected]; DAN HELM <[email protected]> >>> *Sent:* Sunday, July 29, 2012 1:08 PM >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics >>> >>> Hi Guys, >>> >>> Thanks for replying, the problem is whenever I use any -s flag I get the >>> error "Unexpected -s while processing Job-Specific Options:" >>> >>> Also I'm not sure if this is supposed to be the output of -dt >>> >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop >>> starcluster >>> Found 3 items >>> -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 /user/ >>> sgeadmin/text_cvb_document/_SUCCESS >>> drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 /user/ >>> sgeadmin/text_cvb_document/_logs >>> -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 /user/ >>> sgeadmin/text_cvb_document/part-m-00000 >>> >>> Should I be using a newer version of mahout? I've just been using the >>> 0.7 distribution so far as apparently the compiled versions are missing >>> parts that the distributed ones have. >>> >>> Kind Regards, >>> Folcon >>> >>> PS: Thanks for the help so far! >>> >>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: >>> >>> >>> >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]>wrote: >>> >>> Hi Folcon, >>> >>> In the folder you specified for the –dt option for cvb command >>> there should be sequence files with the document to topic associations >>> (Key: >>> IntWritable, Value: VectorWritable). >>> >>> >>> Yeah, this is correct, although this: >>> >>> >>> You can dump in text format as: mahout seqdumper –s <sequence file> >>> >>> >>> is not as good as using vectordumper: >>> >>> mahout vectordump -s <sequence file> --dictionary <path to >>> dictionary.file-0> >>> \ >>> --dictionaryType seqfile --vectorSize <num entries per topic you >>> want to see> -sort >>> >>> This joins your topic vectors with the dictionary, then picks out the >>> top k terms (with their >>> probabilities) for each topic and prints them to the console (or to the >>> file you specify with >>> an --output option). >>> >>> *although* I notice now that in trunk when I just checked, VectorDumper.java >>> had a bug >>> in it for "vectorSize" - line 175 asks for cmdline option " >>> numIndexesPerVector" not >>> vectorSize, ack! So I took the liberty of fixing that, but you'll need >>> to "svn up" and rebuild >>> your jar before using vectordump like this. >>> >>> >>> So in text output from seqdumper, the key is a document id and the >>> vector contains >>> the topics and associated scores associated with the document. I think >>> all topics are listed for each >>> document but many with near zero score. >>> In my case I used rowid to convert keys of original sparse >>> document vectors from Text to Integer before running cvb and this >>> generates a mapping file so I know the textual >>> keys that correspond to the numeric document ids (since my original >>> document ids were file names and I created named vectors). >>> Hope this helps. >>> Dan >>> >>> ________________________________ >>> >>> From: Folcon <[email protected]> >>> To: [email protected] >>> Sent: Saturday, July 28, 2012 8:28 PM >>> Subject: Using Mahout to train an CVB and retrieve it's topics >>> >>> Hi Everyone, >>> >>> I'm posting this as my original message did not seem to appear on the >>> mailing >>> list, I'm very sorry if I have done this in error. >>> >>> I'm doing this to then use the topics to train a maxent algorithm to >>> predict the >>> classes of documents given their topic mixtures. Any further aid in this >>> direction would be appreciated! >>> >>> I've been trying to extract the topics out of my run of cvb. Here's >>> what I did >>> so far. >>> >>> Ok, so I still don't know how to output the topics, but I have worked >>> out how to >>> get the cvb and what I think are the document vectors, however I'm not >>> having >>> any luck dumping them, so help here would still be appreciated! >>> >>> I set the values of: >>> export MAHOUT_HOME=/home/sgeadmin/mahout >>> export HADOOP_HOME=/usr/lib/hadoop >>> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk >>> export HADOOP_CONF_DIR=$HADOOP_HOME/conf >>> on the master otherwise none of this works. >>> >>> So first I uploaded the documents using starclusters put: >>> starcluster put mycluster text_train /home/sgeadmin/ >>> starcluster put mycluster text_test /home/sgeadmin/ >>> >>> Then I added them to hadoop's hbase filesystem: >>> dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop >>> starcluster >>> >>> Then I called Mahout's seqdirectory to turn the text into sequence files >>> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train >>> -- >>> output /user/sgeadmin/text_seq -c UTF-8 -ow >>> >>> Then I called Mahout's seq2parse to turn them into vectors >>> $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin >>> /text_vec - >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow >>> >>> Finally I called cvb, I believe that the -dt flag states where the >>> inferred >>> topics should go, but because I haven't yet been able to dump them I >>> can't >>> confirm this. >>> $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt >>> /user/sgeadmin/text_cvb_document >>> - >>> mt /user/sgeadmin/text_states >>> >>> The -k flag is the number of topics, the -nt flag is the size of the >>> dictionary, >>> I computed this by counting the number of entries of the >>> dictionary.file-0 >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x >>> is the >>> number of iterations. >>> >>> If you know how to get what the document topic probabilities are from >>> here, help >>> would be most appreciated! >>> >>> Kind Regards, >>> Folcon >>> >>> >>> >>> >>> -- >>> >>> -jake >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >
