Re: Using Mahout to train an CVB and retrieve it's topics

DAN HELM Sun, 29 Jul 2012 13:30:15 -0700

Yep something went wrong, most likely with the clustering.  part file is 
empty.  Should look something like this:
 
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: 
{0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
Key: 1: Value: 
{0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
Key: 2: Value: 
{0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
...
...
 
Key refers to a document id and the Value are topic ids:weights assigned to 
document id.
 
So you need to figure out where things went wrong.  I'm assume folder 
/user/sgeadmin/text_lda also has empty part files?  Assuming parts files are 
there run seqdumper on one.  Should have data like the above except in this 
case the key will be a topic id and the vector will be term ids:weights.
 
You can also check folder /user/sgeadmin/text_vec/tf-vectors to make sure 
sparse vectors were generated for your input to cvb.
 
Dan


________________________________
 From: Folcon Red <[email protected]>
To: DAN HELM <[email protected]> 
Cc: Jake Mannix <[email protected]>; "[email protected]" 
<[email protected]> 
Sent: Sunday, July 29, 2012 3:35 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  

Thanks Dan and Jake,

The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i 
/user/sgeadmin/text_cvb_document/part-m-00000 is: 

Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Count: 0 

I'm not certain what went wrong.

Kind Regards,
Folcon

On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:

Folcon,
>  
>I'm still using Mahout 0.6 so don't know much about changes in 0.7.
> 
>Your output folder for "dt" looks correct.  The relevant data would be in  
>/user/sgeadmin/text_cvb_document/part-m-00000 which is what I would be passing 
>to a "-s" option.  But I see it says size is only 97 so that looks 
>suspicious.  So you can just view file (for starters) as: mahout seqdumper -s 
>/user/sgeadmin/text_cvb_document/part-m-00000.  And the vector dumper 
>command (as Jake pointed out) has a lot more options to post-process the data 
>but you may want to first just see what is in that file. 
> 
>Dan
>
> 
> From: Folcon Red <[email protected]>
>To: Jake Mannix <[email protected]> 
>Cc: [email protected]; DAN HELM <[email protected]> 
>Sent: Sunday, July 29, 2012 1:08 PM
>Subject: Re: Using Mahout to train an CVB and retrieve it's topics
>  
>
>
>Hi Guys,
>
>
>Thanks for replying, the problem is whenever I use any -s flag I get the error 
>"Unexpected -s while processing Job-Specific Options:" 
>
>
>Also I'm not sure if this is supposed to be the output of -dt 
>
>
>sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop 
>starcluster
>Found 3 items
>-rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 
>/user/sgeadmin/text_cvb_document/_SUCCESS 
>drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 
>/user/sgeadmin/text_cvb_document/_logs
>-rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 
>/user/sgeadmin/text_cvb_document/part-m-00000 
>
>
>Should I be using a newer version of mahout? I've just been using the 0.7 
>distribution so far as apparently the compiled versions are missing parts that 
>the distributed ones have.
>
>
>Kind Regards,
>Folcon
>
>
>PS: Thanks for the help so far!
>
>
>On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>
>
>>
>>
>>On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>>
>>Hi Folcon,
>>> 
>>>In the folder you specified for the –dt option for cvb command
>>>there should be sequence files with the document to topic associations (Key:
>>>IntWritable, Value: VectorWritable).  
>>
>>
>>Yeah, this is correct, although this:
>>
>>
>>You can dump in text format as: mahout seqdumper –s <sequence file>
>>>
>>
>>
>>is not as good as using vectordumper: 
>>
>>
>>   mahout vectordump -s <sequence file> --dictionary <path 
>>to dictionary.file-0> \ 
>>       --dictionaryType seqfile --vectorSize <num entries per topic you want 
>>to see> -sort  
>>
>>
>>This joins your topic vectors with the dictionary, then picks out the top k 
>>terms (with their 
>>probabilities) for each topic and prints them to the console (or to the file 
>>you specify with 
>>an --output option).
>>
>>
>>*although* I notice now that in trunk when I just checked, VectorDumper.java 
>>had a bug 
>>in it for "vectorSize" - line 175 asks for cmdline option 
>>"numIndexesPerVector" not  
>>vectorSize, ack!  So I took the liberty of fixing that, but you'll need to 
>>"svn up" and rebuild
>>your jar before using vectordump like this. 
>>
>>So in text output from seqdumper, the key is a document id and the vector 
>>contains
>>>the topics and associated scores associated with the document.  I think 
>>>all topics are listed for each
>>>document but many with near zero score.
>>>In my case I used rowid to convert keys of original sparse
>>>document vectors from Text to Integer before running cvb and this generates 
>>>a mapping file so I know the textual
>>>keys that correspond to the numeric document ids (since my original document 
>>>ids were file names and I created named vectors).
>>>Hope this helps.
>>>Dan
>>>
>>>
________________________________
>>>
>>> From: Folcon <[email protected]>
>>>To: [email protected]
>>>Sent: Saturday, July 28, 2012 8:28 PM
>>>Subject: Using Mahout to train an CVB and retrieve it's topics
>>>
>>>
>>>Hi Everyone,
>>>
>>>I'm posting this as my original message did not seem to appear on the mailing
>>>list, I'm very sorry if I have done this in error.
>>>
>>>I'm doing this to then use the topics to train a maxent algorithm to predict 
>>>the
>>>classes of documents given their topic mixtures. Any further aid in this
>>>direction would be appreciated!
>>>
>>>I've been trying to extract the topics out of my run of cvb. Here's what I 
>>>did
>>>so far.
>>>
>>>Ok, so I still don't know how to output the topics, but I have worked out 
>>>how to
>>>get the cvb and what I think are the document vectors, however I'm not having
>>>any luck dumping them, so help here would still be appreciated!
>>>
>>>I set the values of:
>>>    export MAHOUT_HOME=/home/sgeadmin/mahout
>>>    export HADOOP_HOME=/usr/lib/hadoop
>>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>>>on the master otherwise none of this works.
>>>
>>>So first I uploaded the documents using starclusters put:
>>>    starcluster put mycluster text_train /home/sgeadmin/
>>>    starcluster put mycluster text_test /home/sgeadmin/
>>>
>>>Then I added them to hadoop's hbase filesystem:
>>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster
>>>
>>>Then I called Mahout's seqdirectory to turn the text into sequence files
>>>    $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --
>>>output /user/sgeadmin/text_seq -c UTF-8 -ow
>>>
>>>Then I called Mahout's seq2parse to turn them into vectors
>>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o 
>>>/user/sgeadmin/text_vec -
>>>wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>>
>>>Finally I called cvb, I believe that the -dt flag states where the inferred
>>>topics should go, but because I haven't yet been able to dump them I can't
>>>confirm this.
>>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>>>/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>>>/user/sgeadmin/text_vec/dictionary.file-0 -dt 
>>>/user/sgeadmin/text_cvb_document -
>>>mt /user/sgeadmin/text_states
>>>
>>>The -k flag is the number of topics, the -nt flag is the size of the 
>>>dictionary,
>>>I computed this by counting the number of entries of the dictionary.file-0
>>>inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the
>>>number of iterations.
>>>
>>>If you know how to get what the document topic probabilities are from here, 
>>>help
>>>would be most appreciated!
>>>
>>>Kind Regards,
>>>Folcon
>>
>>
>>
>>-- 
>>
>>
>>  -jake
>>
> 
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to