Re: Using Mahout to train an CVB and retrieve it's topics

Folcon Red Tue, 31 Jul 2012 10:50:08 -0700

So part-r-00000 inside text_vec is
still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
even after moving all the training files into a single folder.


Regards,
Folcon

On 31 July 2012 18:18, Folcon Red <[email protected]> wrote:

> Hey Everyone,
>
> Ok not certain why   $MAHOUT_HOME/bin/mahout seqdirectory --input /user/
> sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow  didn't
> produce sequence files, just looking inside text_seq only gives me:
>
> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>
> and that's it. Any ideas what I've been doing wrong? Maybe it's because I
> have the files nested in the folder by class, for example a tree view of
> the directory would look like.
>
> text_train -+
>                 | A -+
>                        | 100
>                        | 101
>                        | 103
>                 | B -+
>                        | 102
>                        | 105
>                        | 106
>
> So it's not picking them up? Or perhaps something else? I'm going to try
> some variations to see what happens.
>
> Thanks for the help so far!
>
> Regards,
> Folcon
>
>
> On 29 July 2012 22:10, Folcon Red <[email protected]> wrote:
>
>> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
>> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>>
>>
>> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}
>>
>> And $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>>
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> Kind Regards,
>> Folcon
>>
>> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote:
>>
>>> Yep something went wrong, most likely with the clustering.  part file is
>>> empty.  Should look something like this:
>>>
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 0: Value:
>>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>>> Key: 1: Value:
>>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>>> Key: 2: Value:
>>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>>> ...
>>> ...
>>>
>>> Key refers to a document id and the Value are topic ids:weights assigned
>>> to document id.
>>>
>>> So you need to figure out where things went wrong.  I'm assume folder
>>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
>>> files are there run seqdumper on one.  Should have data like the above
>>> except in this case the key will be a topic id and the vector will be term
>>> ids:weights.
>>>
>>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
>>> sure sparse vectors were generated for your input to cvb.
>>>
>>> Dan
>>>
>>>    *From:* Folcon Red <[email protected]>
>>> *To:* DAN HELM <[email protected]>
>>> *Cc:* Jake Mannix <[email protected]>; "[email protected]" <
>>> [email protected]>
>>> *Sent:* Sunday, July 29, 2012 3:35 PM
>>>
>>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Thanks Dan and Jake,
>>>
>>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
>>> sgeadmin/text_cvb_document/part-m-00000 is:
>>>
>>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>>> Key class: class org.apache.hadoop.io <http://org.apache.hadoop.io.int/>
>>> .IntWritable Value Class: class org.apache.mahout.math.VectorWritable
>>> Count: 0
>>>
>>> I'm not certain what went wrong.
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:
>>>
>>> Folcon,
>>>
>>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>>>
>>> Your output folder for "dt" looks correct.  The relevant data would be
>>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
>>> be passing to a "-s" option.  But I see it says size is only 97 so that
>>> looks suspicious.  So you can just view file (for starters) as: mahout
>>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
>>> vector dumper command (as Jake pointed out) has a lot more options to
>>> post-process the data but you may want to first just see what is in
>>> that file.
>>>
>>> Dan
>>>
>>>    *From:* Folcon Red <[email protected]>
>>> *To:* Jake Mannix <[email protected]>
>>> *Cc:* [email protected]; DAN HELM <[email protected]>
>>> *Sent:* Sunday, July 29, 2012 1:08 PM
>>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Hi Guys,
>>>
>>> Thanks for replying, the problem is whenever I use any -s flag I get the
>>> error "Unexpected -s while processing Job-Specific Options:"
>>>
>>> Also I'm not sure if this is supposed to be the output of -dt
>>>
>>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
>>> starcluster
>>> Found 3 items
>>> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
>>> sgeadmin/text_cvb_document/_SUCCESS
>>> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/
>>> sgeadmin/text_cvb_document/_logs
>>> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/
>>> sgeadmin/text_cvb_document/part-m-00000
>>>
>>> Should I be using a newer version of mahout? I've just been using the
>>> 0.7 distribution so far as apparently the compiled versions are missing
>>> parts that the distributed ones have.
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>> PS: Thanks for the help so far!
>>>
>>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>>>
>>>
>>>
>>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]>wrote:
>>>
>>> Hi Folcon,
>>>
>>> In the folder you specified for the –dt option for cvb command
>>> there should be sequence files with the document to topic associations
>>> (Key:
>>> IntWritable, Value: VectorWritable).
>>>
>>>
>>> Yeah, this is correct, although this:
>>>
>>>
>>> You can dump in text format as: mahout seqdumper –s <sequence file>
>>>
>>>
>>> is not as good as using vectordumper:
>>>
>>>    mahout vectordump -s <sequence file> --dictionary <path to 
>>> dictionary.file-0>
>>> \
>>>        --dictionaryType seqfile --vectorSize <num entries per topic you
>>> want to see> -sort
>>>
>>> This joins your topic vectors with the dictionary, then picks out the
>>> top k terms (with their
>>> probabilities) for each topic and prints them to the console (or to the
>>> file you specify with
>>> an --output option).
>>>
>>> *although* I notice now that in trunk when I just checked, VectorDumper.java
>>> had a bug
>>> in it for "vectorSize" - line 175 asks for cmdline option "
>>> numIndexesPerVector" not
>>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
>>> to "svn up" and rebuild
>>> your jar before using vectordump like this.
>>>
>>>
>>>  So in text output from seqdumper, the key is a document id and the
>>> vector contains
>>> the topics and associated scores associated with the document.  I think
>>> all topics are listed for each
>>> document but many with near zero score.
>>> In my case I used rowid to convert keys of original sparse
>>> document vectors from Text to Integer before running cvb and this
>>> generates a mapping file so I know the textual
>>> keys that correspond to the numeric document ids (since my original
>>> document ids were file names and I created named vectors).
>>> Hope this helps.
>>> Dan
>>>
>>> ________________________________
>>>
>>>  From: Folcon <[email protected]>
>>> To: [email protected]
>>> Sent: Saturday, July 28, 2012 8:28 PM
>>> Subject: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Hi Everyone,
>>>
>>> I'm posting this as my original message did not seem to appear on the
>>> mailing
>>> list, I'm very sorry if I have done this in error.
>>>
>>> I'm doing this to then use the topics to train a maxent algorithm to
>>> predict the
>>> classes of documents given their topic mixtures. Any further aid in this
>>> direction would be appreciated!
>>>
>>> I've been trying to extract the topics out of my run of cvb. Here's
>>> what I did
>>> so far.
>>>
>>> Ok, so I still don't know how to output the topics, but I have worked
>>> out how to
>>> get the cvb and what I think are the document vectors, however I'm not
>>> having
>>> any luck dumping them, so help here would still be appreciated!
>>>
>>> I set the values of:
>>>     export MAHOUT_HOME=/home/sgeadmin/mahout
>>>     export HADOOP_HOME=/usr/lib/hadoop
>>>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>>> on the master otherwise none of this works.
>>>
>>> So first I uploaded the documents using starclusters put:
>>>     starcluster put mycluster text_train /home/sgeadmin/
>>>     starcluster put mycluster text_test /home/sgeadmin/
>>>
>>> Then I added them to hadoop's hbase filesystem:
>>>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>>> starcluster
>>>
>>> Then I called Mahout's seqdirectory to turn the text into sequence files
>>>     $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>>> --
>>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>>>
>>> Then I called Mahout's seq2parse to turn them into vectors
>>>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>>> /text_vec -
>>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>>
>>> Finally I called cvb, I believe that the -dt flag states where the
>>> inferred
>>> topics should go, but because I haven't yet been able to dump them I
>>> can't
>>> confirm this.
>>>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>>> /user/sgeadmin/text_vec/dictionary.file-0 -dt 
>>> /user/sgeadmin/text_cvb_document
>>> -
>>> mt /user/sgeadmin/text_states
>>>
>>> The -k flag is the number of topics, the -nt flag is the size of the
>>> dictionary,
>>> I computed this by counting the number of entries of the
>>> dictionary.file-0
>>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x
>>> is the
>>> number of iterations.
>>>
>>> If you know how to get what the document topic probabilities are from
>>> here, help
>>> would be most appreciated!
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>>
>>>
>>>
>>> --
>>>
>>>   -jake
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to