Thanks Dan, Ok, now for some strange reason it(seq and vec appear to have values now, will test the complete cvb later, I should head to bed...) appears to be working, The only things I think I changed was I stopped using absolute paths(referring to text_seq as opposed to /user/root/text_seq) and I'm using root now instead of sgeadmin.
Regards, Folcon On 1 August 2012 03:00, DAN HELM <[email protected]> wrote: > Hi Folcon, > > There is no reason to rerun seq2sparse as it is clear something is wrongwith > the text files being processed by > seqdirectory command. > > Based on the keys, I'm assuming the files full path to the input files > are names like /high/59734, etc. Did you look inside the files to make > sure there is text in them? > > As a test, just create a folder with a simple text file and run that > through seqdirectory and I'll bet you will then see output from > seqdumpercommand (from > seqdirectory output). > > Thanks, Dan > > *From:* Folcon Red <[email protected]> > *To:* DAN HELM <[email protected]> > *Cc:* "[email protected]" <[email protected]> > *Sent:* Tuesday, July 31, 2012 7:28 PM > > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics > > Hi Dan, > > It's good to know that seqdirectory reads files in subfolders and I've > dumped out some of the values in the hopes that they will be > enlightening, The values seem to be missing for both the text_seq and > the tokenized-documents. > > So rerunning some of the commands: > $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train > --output /user/sgeadmin/text_seq -c UTF-8 -ow > $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o > /user/sgeadmin/text_vec -wt tf -a > org.apache.lucene.analysis.WhitespaceAnalyzer -ow > > And then doing a seqdumper of text_seq: > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text > [...] > Key: /high/59734: Value: > Key: /high/264596: Value: > Key: /high/341699: Value: > Key: /high/260770: Value: > Key: /high/222320: Value: > Key: /high/198156: Value: > Key: /high/326011: Value: > Key: /high/112050: Value: > Key: /high/306887: Value: > Key: /high/208169: Value: > Key: /high/283464: Value: > Key: /high/168905: Value: > Count: 2548 > > root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i > /user/sgeadmin/text_vec/tf-vectors/part-r-00000 > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Running on hadoop, using /usr/lib/hadoop/bin/hadoop and > HADOOP_CONF_DIR=/conf > MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar > 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments: > {--endPhase=[2147483647], > --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000], > --startPhase=[0], --tempDir=[temp]} > Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000 > Key class: class org.apache.hadoop.io.Text Value Class: class > org.apache.mahout.math.VectorWritable > Count: 0 > > $MAHOUT_HOME/bin/mahout seqdumper -i > /user/sgeadmin/text_vec/tokenized-documents/part-m-00000 > [...] > Key: /high/396063: Value: [] > Key: /high/230246: Value: [] > Key: /high/136284: Value: [] > Key: /high/59734: Value: [] > Key: /high/264596: Value: [] > Key: /high/341699: Value: [] > Key: /high/260770: Value: [] > Key: /high/222320: Value: [] > Key: /high/198156: Value: [] > Key: /high/326011: Value: [] > Key: /high/112050: Value: [] > Key: /high/306887: Value: [] > Key: /high/208169: Value: [] > Key: /high/283464: Value: [] > Key: /high/168905: Value: [] > Count: 2548 > > > Running vectordump on the text_vec folder like so: > root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i > /user/sgeadmin/text_vec > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Running on hadoop, using /usr/lib/hadoop/bin/hadoop and > HADOOP_CONF_DIR=/conf > MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar > 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments: > {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec], > --startPhase=[0], --tempDir=[temp]} > 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false > Exception in thread "main" java.lang.IllegalStateException: > file:/user/sgeadmin/text_vec/tf-vectors > at > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63) > at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Caused by: java.io.FileNotFoundException: > /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.<init>(FileInputStream.java:137) > at > org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>( > RawLocalFileSystem.java:72) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>( > RawLocalFileSystem.java:108) > at > org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>( > ChecksumFileSystem.java:127) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) > at > org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452) > at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) > at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424) > at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419) > at > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init > >(SequenceFileIterator.java:58) > at > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61) > ... 15 more > > Kind Regards, > Nilu > > On 31 July 2012 23:59, DAN HELM <[email protected]> wrote: > > > Folcon, > > > > seqdirectory should also read files in subfolders. > > > > Did you verify that recent seqdirectory command did in fact generate > > non-empty sequence files? I believe seqdirectory command just assumes > > each file contains a single document (no concatenated documents per > > file), and that each file contains basic text. > > > > If it did generate sequence files this time, I am assume your folder > > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there) > before > > you ran seq2sparse on it? > > > > Dan > > > > *From:* Folcon Red <[email protected]> > > *To:* DAN HELM <[email protected]> > > *Cc:* Jake Mannix <[email protected]>; "[email protected]" < > > [email protected]> > > *Sent:* Tuesday, July 31, 2012 1:34 PM > > > > > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics > > > > So part-r-00000 inside text_vec is > > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable > > even after moving all the training files into a single folder. > > > > Regards, > > Folcon > > > > On 31 July 2012 18:18, Folcon Red <[email protected]> wrote: > > > > > Hey Everyone, > > > > > > Ok not certain why $MAHOUT_HOME/bin/mahout seqdirectory --input > /user/ > > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow > didn't > > > > > produce sequence files, just looking inside text_seq only gives me: > > > > > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text > > > > > > and that's it. Any ideas what I've been doing wrong? Maybe it's > because I > > > have the files nested in the folder by class, for example a tree view > of > > > the directory would look like. > > > > > > text_train -+ > > > | A -+ > > > | 100 > > > | 101 > > > | 103 > > > | B -+ > > > | 102 > > > | 105 > > > | 106 > > > > > > So it's not picking them up? Or perhaps something else? I'm going to > try > > > some variations to see what happens. > > > > > > Thanks for the help so far! > > > > > > Regards, > > > Folcon > > > > > > > > > On 29 July 2012 22:10, Folcon Red <[email protected]> wrote: > > > > > >> Right, well here's something promising, running > $MAHOUT_HOME/bin/mahout > > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced: > > >> > > >> > > >> > > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484: > NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN > ,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN > ,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN > ,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN > ,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN > ,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN > ,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN > ,29534:NaN,29535:NaN} > > >> > > >> And $MAHOUT_HOME/bin/mahout seqdumper -i > > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced: > > >> > > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and > > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf > > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar > > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments: > > >> {--endPhase=[2147483647], > > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000], > > >> --startPhase=[0], --tempDir=[temp]} > > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000 > > >> Key class: class org.apache.hadoop.io.Text Value Class: class > > >> org.apache.mahout.math.VectorWritable > > >> Count: 0 > > >> > > >> Kind Regards, > > >> Folcon > > >> > > >> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote: > > >> > > >>> Yep something went wrong, most likely with the clustering. part file > > is > > >>> empty. Should look something like this: > > >>> > > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class > > >>> org.apache.mahout.math.VectorWritable > > >>> Key: 0: Value: > > >>> > > > {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457} > > >>> Key: 1: Value: > > >>> > > > {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756} > > >>> Key: 2: Value: > > >>> > > > {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6} > > >>> ... > > >>> ... > > >>> > > >>> Key refers to a document id and the Value are topic ids:weights > > assigned > > >>> to document id. > > >>> > > >>> So you need to figure out where things went wrong. I'm assume folder > > >>> /user/sgeadmin/text_lda also has empty part files? Assuming parts > > >>> files are there run seqdumper on one. Should have data like the > above > > >>> except in this case the key will be a topic id and the vector will be > > term > > >>> ids:weights. > > >>> > > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make > > >>> sure sparse vectors were generated for your input to cvb. > > >>> > > >>> Dan > > >>> > > >>> *From:* Folcon Red <[email protected]> > > >>> *To:* DAN HELM <[email protected]> > > >>> *Cc:* Jake Mannix <[email protected]>; "[email protected]" > < > > >>> [email protected]> > > >>> *Sent:* Sunday, July 29, 2012 3:35 PM > > >>> > > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics > > > > >>> > > >>> Thanks Dan and Jake, > > >>> > > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/ > > >>> sgeadmin/text_cvb_document/part-m-00000 is: > > >>> > > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000 > > >>> Key class: class org.apache.hadoop.io < > > http://org.apache.hadoop.io.int/> > > > > >>> .IntWritable Value Class: class > org.apache.mahout.math.VectorWritable > > >>> Count: 0 > > >>> > > >>> I'm not certain what went wrong. > > >>> > > >>> Kind Regards, > > >>> Folcon > > >>> > > >>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote: > > >>> > > >>> Folcon, > > >>> > > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7. > > >>> > > >>> Your output folder for "dt" looks correct. The relevant data would > be > > >>> in /user/sgeadmin/text_cvb_document/part-m-00000 which is what I > would > > >>> be passing to a "-s" option. But I see it says size is only 97 so > that > > >>> looks suspicious. So you can just view file (for starters) as: > mahout > > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000. And the > > >>> vector dumper command (as Jake pointed out) has a lot more options to > > >>> post-process the data but you may want to first just see what is in > > >>> that file. > > >>> > > >>> Dan > > >>> > > >>> *From:* Folcon Red <[email protected]> > > >>> *To:* Jake Mannix <[email protected]> > > >>> *Cc:* [email protected]; DAN HELM <[email protected]> > > >>> *Sent:* Sunday, July 29, 2012 1:08 PM > > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics > > > > >>> > > >>> Hi Guys, > > >>> > > >>> Thanks for replying, the problem is whenever I use any -s flag I get > > the > > >>> error "Unexpected -s while processing Job-Specific Options:" > > >>> > > >>> Also I'm not sure if this is supposed to be the output of -dt > > >>> > > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop > > >>> starcluster > > >>> Found 3 items > > >>> -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 /user/ > > >>> sgeadmin/text_cvb_document/_SUCCESS > > >>> drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 /user/ > > >>> sgeadmin/text_cvb_document/_logs > > >>> -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 /user/ > > >>> sgeadmin/text_cvb_document/part-m-00000 > > >>> > > >>> Should I be using a newer version of mahout? I've just been using the > > >>> 0.7 distribution so far as apparently the compiled versions are > missing > > >>> parts that the distributed ones have. > > >>> > > >>> Kind Regards, > > >>> Folcon > > >>> > > >>> PS: Thanks for the help so far! > > >>> > > >>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: > > >>> > > >>> > > >>> > > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected] > > >wrote: > > >>> > > >>> Hi Folcon, > > >>> > > >>> In the folder you specified for the –dt option for cvb command > > >>> there should be sequence files with the document to topic > associations > > >>> (Key: > > >>> IntWritable, Value: VectorWritable). > > >>> > > >>> > > >>> Yeah, this is correct, although this: > > >>> > > >>> > > >>> You can dump in text format as: mahout seqdumper –s <sequence file> > > >>> > > >>> > > >>> is not as good as using vectordumper: > > >>> > > >>> mahout vectordump -s <sequence file> --dictionary <path to > > dictionary.file-0> > > >>> \ > > >>> --dictionaryType seqfile --vectorSize <num entries per topic > you > > >>> want to see> -sort > > >>> > > >>> This joins your topic vectors with the dictionary, then picks out the > > >>> top k terms (with their > > >>> probabilities) for each topic and prints them to the console (or to > the > > >>> file you specify with > > >>> an --output option). > > >>> > > >>> *although* I notice now that in trunk when I just checked, > > VectorDumper.java > > >>> had a bug > > >>> in it for "vectorSize" - line 175 asks for cmdline option " > > >>> numIndexesPerVector" not > > >>> vectorSize, ack! So I took the liberty of fixing that, but you'll > need > > >>> to "svn up" and rebuild > > >>> your jar before using vectordump like this. > > >>> > > >>> > > >>> So in text output from seqdumper, the key is a document id and the > > >>> vector contains > > >>> the topics and associated scores associated with the document. I > think > > >>> all topics are listed for each > > >>> document but many with near zero score. > > >>> In my case I used rowid to convert keys of original sparse > > >>> document vectors from Text to Integer before running cvb and this > > >>> generates a mapping file so I know the textual > > >>> keys that correspond to the numeric document ids (since my original > > >>> document ids were file names and I created named vectors). > > >>> Hope this helps. > > >>> Dan > > >>> > > >>> ________________________________ > > >>> > > >>> From: Folcon <[email protected]> > > >>> To: [email protected] > > >>> Sent: Saturday, July 28, 2012 8:28 PM > > >>> Subject: Using Mahout to train an CVB and retrieve it's topics > > >>> > > >>> Hi Everyone, > > >>> > > >>> I'm posting this as my original message did not seem to appear on the > > >>> mailing > > >>> list, I'm very sorry if I have done this in error. > > >>> > > >>> I'm doing this to then use the topics to train a maxent algorithm to > > >>> predict the > > >>> classes of documents given their topic mixtures. Any further aid in > > this > > >>> direction would be appreciated! > > >>> > > >>> I've been trying to extract the topics out of my run of cvb. Here's > > >>> what I did > > >>> so far. > > >>> > > >>> Ok, so I still don't know how to output the topics, but I have > worked > > >>> out how to > > >>> get the cvb and what I think are the document vectors, however I'm > not > > >>> having > > >>> any luck dumping them, so help here would still be appreciated! > > >>> > > >>> I set the values of: > > >>> export MAHOUT_HOME=/home/sgeadmin/mahout > > >>> export HADOOP_HOME=/usr/lib/hadoop > > >>> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk > > >>> export HADOOP_CONF_DIR=$HADOOP_HOME/conf > > >>> on the master otherwise none of this works. > > >>> > > >>> So first I uploaded the documents using starclusters put: > > >>> starcluster put mycluster text_train /home/sgeadmin/ > > >>> starcluster put mycluster text_test /home/sgeadmin/ > > >>> > > >>> Then I added them to hadoop's hbase filesystem: > > >>> dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop > > >>> starcluster > > >>> > > >>> Then I called Mahout's seqdirectory to turn the text into sequence > > files > > >>> $MAHOUT_HOME/bin/mahout seqdirectory --input > > /user/sgeadmin/text_train > > >>> -- > > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow > > >>> > > >>> Then I called Mahout's seq2parse to turn them into vectors > > >>> $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin > > >>> /text_vec - > > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow > > >>> > > >>> Finally I called cvb, I believe that the -dt flag states where the > > >>> inferred > > >>> topics should go, but because I haven't yet been able to dump them I > > >>> can't > > >>> confirm this. > > >>> $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors > -o > > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict > > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt > > /user/sgeadmin/text_cvb_document > > >>> - > > >>> mt /user/sgeadmin/text_states > > >>> > > >>> The -k flag is the number of topics, the -nt flag is the size of the > > >>> dictionary, > > >>> I computed this by counting the number of entries of the > > >>> dictionary.file-0 > > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and > -x > > >>> is the > > >>> number of iterations. > > >>> > > >>> If you know how to get what the document topic probabilities are from > > >>> here, help > > >>> would be most appreciated! > > >>> > > >>> Kind Regards, > > >>> Folcon > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> > > >>> -jake > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >> > > >> > > > > > > > > > > > > > > >
