Thanks Dan,

Ok, now for some strange reason it(seq and vec appear to have values now,
will test the complete cvb later, I should head to bed...) appears to be
working, The only things I think I changed was I stopped using absolute
paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
using root now instead of sgeadmin.

Regards,
Folcon

On 1 August 2012 03:00, DAN HELM <[email protected]> wrote:

> Hi Folcon,
>
> There is no reason to rerun seq2sparse as it is clear something is wrongwith 
> the text files being processed by
> seqdirectory command.
>
> Based on the keys, I'm assuming the files full path to the input files
> are names like /high/59734, etc.  Did you look inside the files to make
> sure there is text in them?
>
> As a test, just create a folder with a simple text file and run that
> through seqdirectory and I'll bet you will then see output from 
> seqdumpercommand (from
> seqdirectory output).
>
> Thanks, Dan
>
>    *From:* Folcon Red <[email protected]>
> *To:* DAN HELM <[email protected]>
> *Cc:* "[email protected]" <[email protected]>
> *Sent:* Tuesday, July 31, 2012 7:28 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Dan,
>
> It's good to know that seqdirectory reads files in subfolders and I've
> dumped out some of the values in the hopes that they will be
> enlightening, The values seem to be missing for both the text_seq and
> the tokenized-documents.
>
> So rerunning some of the commands:
> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> --output /user/sgeadmin/text_seq -c UTF-8 -ow
> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
> /user/sgeadmin/text_vec -wt tf -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>
> And then doing a seqdumper of text_seq:
> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> [...]
> Key: /high/59734: Value:
> Key: /high/264596: Value:
> Key: /high/341699: Value:
> Key: /high/260770: Value:
> Key: /high/222320: Value:
> Key: /high/198156: Value:
> Key: /high/326011: Value:
> Key: /high/112050: Value:
> Key: /high/306887: Value:
> Key: /high/208169: Value:
> Key: /high/283464: Value:
> Key: /high/168905: Value:
> Count: 2548
>
> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/conf
> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647],
> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> --startPhase=[0], --tempDir=[temp]}
> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> $MAHOUT_HOME/bin/mahout seqdumper -i
> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
> [...]
> Key: /high/396063: Value: []
> Key: /high/230246: Value: []
> Key: /high/136284: Value: []
> Key: /high/59734: Value: []
> Key: /high/264596: Value: []
> Key: /high/341699: Value: []
> Key: /high/260770: Value: []
> Key: /high/222320: Value: []
> Key: /high/198156: Value: []
> Key: /high/326011: Value: []
> Key: /high/112050: Value: []
> Key: /high/306887: Value: []
> Key: /high/208169: Value: []
> Key: /high/283464: Value: []
> Key: /high/168905: Value: []
> Count: 2548
>
>
> Running vectordump on the text_vec folder like so:
> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
> /user/sgeadmin/text_vec
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/conf
> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
> --startPhase=[0], --tempDir=[temp]}
> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
> Exception in thread "main" java.lang.IllegalStateException:
> file:/user/sgeadmin/text_vec/tf-vectors
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: java.io.FileNotFoundException:
> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.<init>(FileInputStream.java:137)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
> RawLocalFileSystem.java:72)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
> RawLocalFileSystem.java:108)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
> ChecksumFileSystem.java:127)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
> at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init
> >(SequenceFileIterator.java:58)
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
> ... 15 more
>
> Kind Regards,
> Nilu
>
> On 31 July 2012 23:59, DAN HELM <[email protected]> wrote:
>
> > Folcon,
> >
> > seqdirectory should also read files in subfolders.
> >
> > Did you verify that recent seqdirectory command did in fact generate
> > non-empty sequence files?  I believe seqdirectory command just assumes
> > each file contains a single document (no concatenated documents per
> > file), and that each file contains basic text.
> >
> > If it did generate sequence files this time, I am assume your folder
> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
> before
> > you ran seq2sparse on it?
> >
> > Dan
> >
> >    *From:* Folcon Red <[email protected]>
> > *To:* DAN HELM <[email protected]>
> > *Cc:* Jake Mannix <[email protected]>; "[email protected]" <
> > [email protected]>
> > *Sent:* Tuesday, July 31, 2012 1:34 PM
>
> >
> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > So part-r-00000 inside text_vec is
> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> > even after moving all the training files into a single folder.
> >
> > Regards,
> > Folcon
> >
> > On 31 July 2012 18:18, Folcon Red <[email protected]> wrote:
> >
> > > Hey Everyone,
> > >
> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
> /user/
> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
> didn't
> >
> > > produce sequence files, just looking inside text_seq only gives me:
> > >
> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> > >
> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
> because I
> > > have the files nested in the folder by class, for example a tree view
> of
> > > the directory would look like.
> > >
> > > text_train -+
> > >                | A -+
> > >                        | 100
> > >                        | 101
> > >                        | 103
> > >                | B -+
> > >                        | 102
> > >                        | 105
> > >                        | 106
> > >
> > > So it's not picking them up? Or perhaps something else? I'm going to
> try
> > > some variations to see what happens.
> > >
> > > Thanks for the help so far!
> > >
> > > Regards,
> > > Folcon
> > >
> > >
> > > On 29 July 2012 22:10, Folcon Red <[email protected]> wrote:
> > >
> > >> Right, well here's something promising, running
> $MAHOUT_HOME/bin/mahout
> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> > >>
> > >>
> > >>
> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:
> NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN
> ,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN
> ,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN
> ,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN
> ,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN
> ,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN
> ,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN
> ,29534:NaN,29535:NaN}
> > >>
> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> > >>
> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> > >> {--endPhase=[2147483647],
> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> > >> --startPhase=[0], --tempDir=[temp]}
> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
> > >> org.apache.mahout.math.VectorWritable
> > >> Count: 0
> > >>
> > >> Kind Regards,
> > >> Folcon
> > >>
> > >> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote:
> > >>
> > >>> Yep something went wrong, most likely with the clustering.  part file
> > is
> > >>> empty.  Should look something like this:
> > >>>
> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > >>> org.apache.mahout.math.VectorWritable
> > >>> Key: 0: Value:
> > >>>
> >
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> > >>> Key: 1: Value:
> > >>>
> >
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> > >>> Key: 2: Value:
> > >>>
> >
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> > >>> ...
> > >>> ...
> > >>>
> > >>> Key refers to a document id and the Value are topic ids:weights
> > assigned
> > >>> to document id.
> > >>>
> > >>> So you need to figure out where things went wrong.  I'm assume folder
> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
> > >>> files are there run seqdumper on one.  Should have data like the
> above
> > >>> except in this case the key will be a topic id and the vector will be
> > term
> > >>> ids:weights.
> > >>>
> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
> > >>> sure sparse vectors were generated for your input to cvb.
> > >>>
> > >>> Dan
> > >>>
> > >>>    *From:* Folcon Red <[email protected]>
> > >>> *To:* DAN HELM <[email protected]>
> > >>> *Cc:* Jake Mannix <[email protected]>; "[email protected]"
> <
> > >>> [email protected]>
> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> > >>>
> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > >>>
> > >>> Thanks Dan and Jake,
> > >>>
> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
> > >>>
> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> > >>> Key class: class org.apache.hadoop.io <
> > http://org.apache.hadoop.io.int/>
> >
> > >>> .IntWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> > >>> Count: 0
> > >>>
> > >>> I'm not certain what went wrong.
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:
> > >>>
> > >>> Folcon,
> > >>>
> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
> > >>>
> > >>> Your output folder for "dt" looks correct.  The relevant data would
> be
> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
> would
> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
> that
> > >>> looks suspicious.  So you can just view file (for starters) as:
> mahout
> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> > >>> vector dumper command (as Jake pointed out) has a lot more options to
> > >>> post-process the data but you may want to first just see what is in
> > >>> that file.
> > >>>
> > >>> Dan
> > >>>
> > >>>    *From:* Folcon Red <[email protected]>
> > >>> *To:* Jake Mannix <[email protected]>
> > >>> *Cc:* [email protected]; DAN HELM <[email protected]>
> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > >>>
> > >>> Hi Guys,
> > >>>
> > >>> Thanks for replying, the problem is whenever I use any -s flag I get
> > the
> > >>> error "Unexpected -s while processing Job-Specific Options:"
> > >>>
> > >>> Also I'm not sure if this is supposed to be the output of -dt
> > >>>
> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> > >>> starcluster
> > >>> Found 3 items
> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> > >>> sgeadmin/text_cvb_document/_SUCCESS
> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> > >>> sgeadmin/text_cvb_document/_logs
> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
> > >>> sgeadmin/text_cvb_document/part-m-00000
> > >>>
> > >>> Should I be using a newer version of mahout? I've just been using the
> > >>> 0.7 distribution so far as apparently the compiled versions are
> missing
> > >>> parts that the distributed ones have.
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>> PS: Thanks for the help so far!
> > >>>
> > >>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]
> > >wrote:
> > >>>
> > >>> Hi Folcon,
> > >>>
> > >>> In the folder you specified for the –dt option for cvb command
> > >>> there should be sequence files with the document to topic
> associations
> > >>> (Key:
> > >>> IntWritable, Value: VectorWritable).
> > >>>
> > >>>
> > >>> Yeah, this is correct, although this:
> > >>>
> > >>>
> > >>> You can dump in text format as: mahout seqdumper –s <sequence file>
> > >>>
> > >>>
> > >>> is not as good as using vectordumper:
> > >>>
> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
> > dictionary.file-0>
> > >>> \
> > >>>        --dictionaryType seqfile --vectorSize <num entries per topic
> you
> > >>> want to see> -sort
> > >>>
> > >>> This joins your topic vectors with the dictionary, then picks out the
> > >>> top k terms (with their
> > >>> probabilities) for each topic and prints them to the console (or to
> the
> > >>> file you specify with
> > >>> an --output option).
> > >>>
> > >>> *although* I notice now that in trunk when I just checked,
> > VectorDumper.java
> > >>> had a bug
> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
> > >>> numIndexesPerVector" not
> > >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll
> need
> > >>> to "svn up" and rebuild
> > >>> your jar before using vectordump like this.
> > >>>
> > >>>
> > >>>  So in text output from seqdumper, the key is a document id and the
> > >>> vector contains
> > >>> the topics and associated scores associated with the document.  I
> think
> > >>> all topics are listed for each
> > >>> document but many with near zero score.
> > >>> In my case I used rowid to convert keys of original sparse
> > >>> document vectors from Text to Integer before running cvb and this
> > >>> generates a mapping file so I know the textual
> > >>> keys that correspond to the numeric document ids (since my original
> > >>> document ids were file names and I created named vectors).
> > >>> Hope this helps.
> > >>> Dan
> > >>>
> > >>> ________________________________
> > >>>
> > >>>  From: Folcon <[email protected]>
> > >>> To: [email protected]
> > >>> Sent: Saturday, July 28, 2012 8:28 PM
> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> > >>>
> > >>> Hi Everyone,
> > >>>
> > >>> I'm posting this as my original message did not seem to appear on the
> > >>> mailing
> > >>> list, I'm very sorry if I have done this in error.
> > >>>
> > >>> I'm doing this to then use the topics to train a maxent algorithm to
> > >>> predict the
> > >>> classes of documents given their topic mixtures. Any further aid in
> > this
> > >>> direction would be appreciated!
> > >>>
> > >>> I've been trying to extract the topics out of my run of cvb. Here's
> > >>> what I did
> > >>> so far.
> > >>>
> > >>> Ok, so I still don't know how to output the topics, but I have
> worked
> > >>> out how to
> > >>> get the cvb and what I think are the document vectors, however I'm
> not
> > >>> having
> > >>> any luck dumping them, so help here would still be appreciated!
> > >>>
> > >>> I set the values of:
> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> > >>>    export HADOOP_HOME=/usr/lib/hadoop
> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> > >>> on the master otherwise none of this works.
> > >>>
> > >>> So first I uploaded the documents using starclusters put:
> > >>>    starcluster put mycluster text_train /home/sgeadmin/
> > >>>    starcluster put mycluster text_test /home/sgeadmin/
> > >>>
> > >>> Then I added them to hadoop's hbase filesystem:
> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> > >>> starcluster
> > >>>
> > >>> Then I called Mahout's seqdirectory to turn the text into sequence
> > files
> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> > /user/sgeadmin/text_train
> > >>> --
> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> > >>>
> > >>> Then I called Mahout's seq2parse to turn them into vectors
> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
> > >>> /text_vec -
> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> > >>>
> > >>> Finally I called cvb, I believe that the -dt flag states where the
> > >>> inferred
> > >>> topics should go, but because I haven't yet been able to dump them I
> > >>> can't
> > >>> confirm this.
> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors
> -o
> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> > /user/sgeadmin/text_cvb_document
> > >>> -
> > >>> mt /user/sgeadmin/text_states
> > >>>
> > >>> The -k flag is the number of topics, the -nt flag is the size of the
> > >>> dictionary,
> > >>> I computed this by counting the number of entries of the
> > >>> dictionary.file-0
> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and
> -x
> > >>> is the
> > >>> number of iterations.
> > >>>
> > >>> If you know how to get what the document topic probabilities are from
> > >>> here, help
> > >>> would be most appreciated!
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>>  -jake
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
> >
> >
> >
>
>
>
>

Reply via email to