Re: Using Mahout to train an CVB and retrieve it's topics

Folcon Red Tue, 31 Jul 2012 10:27:01 -0700

Hey Everyone,

Ok not certain why   $MAHOUT_HOME/bin/mahout seqdirectory --input /user/
sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow  didn't
produce sequence files, just looking inside text_seq only gives me:


SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text

and that's it. Any ideas what I've been doing wrong? Maybe it's because I
have the files nested in the folder by class, for example a tree view of
the directory would look like.

text_train -+
                | A -+
                       | 100
                       | 101
                       | 103
                | B -+
                       | 102
                       | 105
                       | 106

So it's not picking them up? Or perhaps something else? I'm going to try
some variations to see what happens.

Thanks for the help so far!

Regards,
Folcon

On 29 July 2012 22:10, Folcon Red <[email protected]> wrote:

> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>
>
> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}
>
> And $MAHOUT_HOME/bin/mahout seqdumper -i
> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647],
> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> --startPhase=[0], --tempDir=[temp]}
> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> Kind Regards,
> Folcon
>
> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote:
>
>> Yep something went wrong, most likely with the clustering.  part file is
>> empty.  Should look something like this:
>>
>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 0: Value:
>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>> Key: 1: Value:
>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>> Key: 2: Value:
>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>> ...
>> ...
>>
>> Key refers to a document id and the Value are topic ids:weights assigned
>> to document id.
>>
>> So you need to figure out where things went wrong.  I'm assume folder
>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts files
>> are there run seqdumper on one.  Should have data like the above except
>> in this case the key will be a topic id and the vector will be term ids:
>> weights.
>>
>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
>> sure sparse vectors were generated for your input to cvb.
>>
>> Dan
>>
>>    *From:* Folcon Red <[email protected]>
>> *To:* DAN HELM <[email protected]>
>> *Cc:* Jake Mannix <[email protected]>; "[email protected]" <
>> [email protected]>
>> *Sent:* Sunday, July 29, 2012 3:35 PM
>>
>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>
>> Thanks Dan and Jake,
>>
>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/sgeadmin
>> /text_cvb_document/part-m-00000 is:
>>
>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>> Key class: class org.apache.hadoop.io <http://org.apache.hadoop.io.int/>.
>> IntWritable Value Class: class org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> I'm not certain what went wrong.
>>
>> Kind Regards,
>> Folcon
>>
>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:
>>
>> Folcon,
>>
>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>>
>> Your output folder for "dt" looks correct.  The relevant data would be
>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
>> be passing to a "-s" option.  But I see it says size is only 97 so that
>> looks suspicious.  So you can just view file (for starters) as: mahout
>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
>> vector dumper command (as Jake pointed out) has a lot more options to
>> post-process the data but you may want to first just see what is in that
>> file.
>>
>> Dan
>>
>>    *From:* Folcon Red <[email protected]>
>> *To:* Jake Mannix <[email protected]>
>> *Cc:* [email protected]; DAN HELM <[email protected]>
>> *Sent:* Sunday, July 29, 2012 1:08 PM
>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Guys,
>>
>> Thanks for replying, the problem is whenever I use any -s flag I get the
>> error "Unexpected -s while processing Job-Specific Options:"
>>
>> Also I'm not sure if this is supposed to be the output of -dt
>>
>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
>> starcluster
>> Found 3 items
>> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
>> sgeadmin/text_cvb_document/_SUCCESS
>> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/
>> sgeadmin/text_cvb_document/_logs
>> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/
>> sgeadmin/text_cvb_document/part-m-00000
>>
>> Should I be using a newer version of mahout? I've just been using the 0.7
>> distribution so far as apparently the compiled versions are missing parts
>> that the distributed ones have.
>>
>> Kind Regards,
>> Folcon
>>
>> PS: Thanks for the help so far!
>>
>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>>
>>
>>
>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>>
>> Hi Folcon,
>>
>> In the folder you specified for the –dt option for cvb command
>> there should be sequence files with the document to topic associations
>> (Key:
>> IntWritable, Value: VectorWritable).
>>
>>
>> Yeah, this is correct, although this:
>>
>>
>> You can dump in text format as: mahout seqdumper –s <sequence file>
>>
>>
>> is not as good as using vectordumper:
>>
>>    mahout vectordump -s <sequence file> --dictionary <path to 
>> dictionary.file-0>
>> \
>>        --dictionaryType seqfile --vectorSize <num entries per topic you
>> want to see> -sort
>>
>> This joins your topic vectors with the dictionary, then picks out the top
>> k terms (with their
>> probabilities) for each topic and prints them to the console (or to the
>> file you specify with
>> an --output option).
>>
>> *although* I notice now that in trunk when I just checked, VectorDumper.java
>> had a bug
>> in it for "vectorSize" - line 175 asks for cmdline option "
>> numIndexesPerVector" not
>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
>> to "svn up" and rebuild
>> your jar before using vectordump like this.
>>
>>
>>  So in text output from seqdumper, the key is a document id and the
>> vector contains
>> the topics and associated scores associated with the document.  I think
>> all topics are listed for each
>> document but many with near zero score.
>> In my case I used rowid to convert keys of original sparse
>> document vectors from Text to Integer before running cvb and this
>> generates a mapping file so I know the textual
>> keys that correspond to the numeric document ids (since my original
>> document ids were file names and I created named vectors).
>> Hope this helps.
>> Dan
>>
>> ________________________________
>>
>>  From: Folcon <[email protected]>
>> To: [email protected]
>> Sent: Saturday, July 28, 2012 8:28 PM
>> Subject: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Everyone,
>>
>> I'm posting this as my original message did not seem to appear on the
>> mailing
>> list, I'm very sorry if I have done this in error.
>>
>> I'm doing this to then use the topics to train a maxent algorithm to
>> predict the
>> classes of documents given their topic mixtures. Any further aid in this
>> direction would be appreciated!
>>
>> I've been trying to extract the topics out of my run of cvb. Here's what
>> I did
>> so far.
>>
>> Ok, so I still don't know how to output the topics, but I have worked
>> out how to
>> get the cvb and what I think are the document vectors, however I'm not
>> having
>> any luck dumping them, so help here would still be appreciated!
>>
>> I set the values of:
>>     export MAHOUT_HOME=/home/sgeadmin/mahout
>>     export HADOOP_HOME=/usr/lib/hadoop
>>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>> on the master otherwise none of this works.
>>
>> So first I uploaded the documents using starclusters put:
>>     starcluster put mycluster text_train /home/sgeadmin/
>>     starcluster put mycluster text_test /home/sgeadmin/
>>
>> Then I added them to hadoop's hbase filesystem:
>>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>> starcluster
>>
>> Then I called Mahout's seqdirectory to turn the text into sequence files
>>     $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>> --
>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>>
>> Then I called Mahout's seq2parse to turn them into vectors
>>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>> /text_vec -
>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> Finally I called cvb, I believe that the -dt flag states where the
>> inferred
>> topics should go, but because I haven't yet been able to dump them I can't
>> confirm this.
>>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>> /user/sgeadmin/text_vec/dictionary.file-0 -dt 
>> /user/sgeadmin/text_cvb_document
>> -
>> mt /user/sgeadmin/text_states
>>
>> The -k flag is the number of topics, the -nt flag is the size of the
>> dictionary,
>> I computed this by counting the number of entries of the dictionary.file-0
>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is
>> the
>> number of iterations.
>>
>> If you know how to get what the document topic probabilities are from
>> here, help
>> would be most appreciated!
>>
>> Kind Regards,
>> Folcon
>>
>>
>>
>>
>> --
>>
>>   -jake
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to