Re: Using Mahout to train an CVB and retrieve it's topics

Folcon Red Sun, 29 Jul 2012 14:33:18 -0700

Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:


7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}

And $MAHOUT_HOME/bin/mahout seqdumper -i
/user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
--startPhase=[0], --tempDir=[temp]}
Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

Kind Regards,
Folcon

On 29 July 2012 21:29, DAN HELM <[email protected]> wrote:

> Yep something went wrong, most likely with the clustering.  part file is
> empty.  Should look something like this:
>
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 0: Value:
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> Key: 1: Value:
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> Key: 2: Value:
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> ...
> ...
>
> Key refers to a document id and the Value are topic ids:weights assigned
> to document id.
>
> So you need to figure out where things went wrong.  I'm assume folder
> /user/sgeadmin/text_lda also has empty part files?  Assuming parts files
> are there run seqdumper on one.  Should have data like the above except
> in this case the key will be a topic id and the vector will be term ids:
> weights.
>
> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make sure
> sparse vectors were generated for your input to cvb.
>
> Dan
>
>    *From:* Folcon Red <[email protected]>
> *To:* DAN HELM <[email protected]>
> *Cc:* Jake Mannix <[email protected]>; "[email protected]" <
> [email protected]>
> *Sent:* Sunday, July 29, 2012 3:35 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Thanks Dan and Jake,
>
> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/sgeadmin
> /text_cvb_document/part-m-00000 is:
>
> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> Key class: class org.apache.hadoop.io <http://org.apache.hadoop.io.int/>.
> IntWritable Value Class: class org.apache.mahout.math.VectorWritable
> Count: 0
>
> I'm not certain what went wrong.
>
> Kind Regards,
> Folcon
>
> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:
>
> Folcon,
>
> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>
> Your output folder for "dt" looks correct.  The relevant data would be
> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
> be passing to a "-s" option.  But I see it says size is only 97 so that
> looks suspicious.  So you can just view file (for starters) as: mahout
> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> vector dumper command (as Jake pointed out) has a lot more options to 
> post-process
> the data but you may want to first just see what is in that file.
>
> Dan
>
>    *From:* Folcon Red <[email protected]>
> *To:* Jake Mannix <[email protected]>
> *Cc:* [email protected]; DAN HELM <[email protected]>
> *Sent:* Sunday, July 29, 2012 1:08 PM
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Guys,
>
> Thanks for replying, the problem is whenever I use any -s flag I get the
> error "Unexpected -s while processing Job-Specific Options:"
>
> Also I'm not sure if this is supposed to be the output of -dt
>
> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> starcluster
> Found 3 items
> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> sgeadmin/text_cvb_document/_SUCCESS
> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> sgeadmin/text_cvb_document/_logs
> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/
> sgeadmin/text_cvb_document/part-m-00000
>
> Should I be using a newer version of mahout? I've just been using the 0.7
> distribution so far as apparently the compiled versions are missing parts
> that the distributed ones have.
>
> Kind Regards,
> Folcon
>
> PS: Thanks for the help so far!
>
> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>
>
>
> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>
> Hi Folcon,
>
> In the folder you specified for the –dt option for cvb command
> there should be sequence files with the document to topic associations
> (Key:
> IntWritable, Value: VectorWritable).
>
>
> Yeah, this is correct, although this:
>
>
> You can dump in text format as: mahout seqdumper –s <sequence file>
>
>
> is not as good as using vectordumper:
>
>    mahout vectordump -s <sequence file> --dictionary <path to 
> dictionary.file-0>
> \
>        --dictionaryType seqfile --vectorSize <num entries per topic you
> want to see> -sort
>
> This joins your topic vectors with the dictionary, then picks out the top
> k terms (with their
> probabilities) for each topic and prints them to the console (or to the
> file you specify with
> an --output option).
>
> *although* I notice now that in trunk when I just checked, VectorDumper.java
> had a bug
> in it for "vectorSize" - line 175 asks for cmdline option "
> numIndexesPerVector" not
> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
> to "svn up" and rebuild
> your jar before using vectordump like this.
>
>
>  So in text output from seqdumper, the key is a document id and the
> vector contains
> the topics and associated scores associated with the document.  I think
> all topics are listed for each
> document but many with near zero score.
> In my case I used rowid to convert keys of original sparse
> document vectors from Text to Integer before running cvb and this
> generates a mapping file so I know the textual
> keys that correspond to the numeric document ids (since my original
> document ids were file names and I created named vectors).
> Hope this helps.
> Dan
>
> ________________________________
>
>  From: Folcon <[email protected]>
> To: [email protected]
> Sent: Saturday, July 28, 2012 8:28 PM
> Subject: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Everyone,
>
> I'm posting this as my original message did not seem to appear on the
> mailing
> list, I'm very sorry if I have done this in error.
>
> I'm doing this to then use the topics to train a maxent algorithm to
> predict the
> classes of documents given their topic mixtures. Any further aid in this
> direction would be appreciated!
>
> I've been trying to extract the topics out of my run of cvb. Here's what
> I did
> so far.
>
> Ok, so I still don't know how to output the topics, but I have worked out
> how to
> get the cvb and what I think are the document vectors, however I'm not
> having
> any luck dumping them, so help here would still be appreciated!
>
> I set the values of:
>     export MAHOUT_HOME=/home/sgeadmin/mahout
>     export HADOOP_HOME=/usr/lib/hadoop
>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> on the master otherwise none of this works.
>
> So first I uploaded the documents using starclusters put:
>     starcluster put mycluster text_train /home/sgeadmin/
>     starcluster put mycluster text_test /home/sgeadmin/
>
> Then I added them to hadoop's hbase filesystem:
>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> starcluster
>
> Then I called Mahout's seqdirectory to turn the text into sequence files
>     $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> --
> output /user/sgeadmin/text_seq -c UTF-8 -ow
>
> Then I called Mahout's seq2parse to turn them into vectors
>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_
> vec -
> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>
> Finally I called cvb, I believe that the -dt flag states where the
> inferred
> topics should go, but because I haven't yet been able to dump them I can't
> confirm this.
>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document
> -
> mt /user/sgeadmin/text_states
>
> The -k flag is the number of topics, the -nt flag is the size of the
> dictionary,
> I computed this by counting the number of entries of the dictionary.file-0
> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is
> the
> number of iterations.
>
> If you know how to get what the document topic probabilities are from
> here, help
> would be most appreciated!
>
> Kind Regards,
> Folcon
>
>
>
>
> --
>
>   -jake
>
>
>
>
>
>
>
>
>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to