Re: Using Mahout to train an CVB and retrieve it's topics

DAN HELM Sun, 29 Jul 2012 10:50:23 -0700

Folcon,
 
I'm still using Mahout 0.6 so don't know much about changes in 0.7.
 
Your output folder for "dt" looks correct.  The relevant data would be in  
/user/sgeadmin/text_cvb_document/part-m-00000 which is what I would be passing 
to a "-s" option.  But I see it says size is only 97 so that looks suspicious.  
So you can just view file (for starters) as: mahout seqdumper -s 
/user/sgeadmin/text_cvb_document/part-m-00000.  And the vector dumper 
command (as Jake pointed out) has a lot more options to post-process the data 
but you may want to first just see what is in that file.
 
Dan


________________________________
 From: Folcon Red <[email protected]>
To: Jake Mannix <[email protected]> 
Cc: [email protected]; DAN HELM <[email protected]> 
Sent: Sunday, July 29, 2012 1:08 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  

Hi Guys,

Thanks for replying, the problem is whenever I use any -s flag I get the error 
"Unexpected -s while processing Job-Specific Options:" 

Also I'm not sure if this is supposed to be the output of -dt 

sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop starcluster
Found 3 items
-rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 
/user/sgeadmin/text_cvb_document/_SUCCESS 
drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 
/user/sgeadmin/text_cvb_document/_logs
-rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 
/user/sgeadmin/text_cvb_document/part-m-00000 

Should I be using a newer version of mahout? I've just been using the 0.7 
distribution so far as apparently the compiled versions are missing parts that 
the distributed ones have.

Kind Regards,
Folcon


PS: Thanks for the help so far!

On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:


>
>
>On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>
>Hi Folcon,
>> 
>>In the folder you specified for the –dt option for cvb command
>>there should be sequence files with the document to topic associations (Key:
>>IntWritable, Value: VectorWritable).  
>
>
>Yeah, this is correct, although this:
>
>
>You can dump in text format as: mahout seqdumper –s <sequence file>
>>
>
>
>is not as good as using vectordumper: 
>
>
>   mahout vectordump -s <sequence file> --dictionary <path 
>to dictionary.file-0> \ 
>       --dictionaryType seqfile --vectorSize <num entries per topic you want 
>to see> -sort  
>
>
>This joins your topic vectors with the dictionary, then picks out the top k 
>terms (with their 
>probabilities) for each topic and prints them to the console (or to the file 
>you specify with 
>an --output option).
>
>
>*although* I notice now that in trunk when I just checked, VectorDumper.java 
>had a bug 
>in it for "vectorSize" - line 175 asks for cmdline option 
>"numIndexesPerVector" not  
>vectorSize, ack!  So I took the liberty of fixing that, but you'll need to 
>"svn up" and rebuild
>your jar before using vectordump like this.
>
>So in text output from seqdumper, the key is a document id and the vector 
>contains
>>the topics and associated scores associated with the document.  I think 
>>all topics are listed for each
>>document but many with near zero score.
>>In my case I used rowid to convert keys of original sparse
>>document vectors from Text to Integer before running cvb and this generates a 
>>mapping file so I know the textual
>>keys that correspond to the numeric document ids (since my original document 
>>ids were file names and I created named vectors).
>>Hope this helps.
>>Dan
>>
>>
________________________________
>>
>> From: Folcon <[email protected]>
>>To: [email protected]
>>Sent: Saturday, July 28, 2012 8:28 PM
>>Subject: Using Mahout to train an CVB and retrieve it's topics
>>
>>
>>Hi Everyone,
>>
>>I'm posting this as my original message did not seem to appear on the mailing
>>list, I'm very sorry if I have done this in error.
>>
>>I'm doing this to then use the topics to train a maxent algorithm to predict 
>>the
>>classes of documents given their topic mixtures. Any further aid in this
>>direction would be appreciated!
>>
>>I've been trying to extract the topics out of my run of cvb. Here's what I did
>>so far.
>>
>>Ok, so I still don't know how to output the topics, but I have worked out how 
>>to
>>get the cvb and what I think are the document vectors, however I'm not having
>>any luck dumping them, so help here would still be appreciated!
>>
>>I set the values of:
>>    export MAHOUT_HOME=/home/sgeadmin/mahout
>>    export HADOOP_HOME=/usr/lib/hadoop
>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>>on the master otherwise none of this works.
>>
>>So first I uploaded the documents using starclusters put:
>>    starcluster put mycluster text_train /home/sgeadmin/
>>    starcluster put mycluster text_test /home/sgeadmin/
>>
>>Then I added them to hadoop's hbase filesystem:
>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster
>>
>>Then I called Mahout's seqdirectory to turn the text into sequence files
>>    $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --
>>output /user/sgeadmin/text_seq -c UTF-8 -ow
>>
>>Then I called Mahout's seq2parse to turn them into vectors
>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec 
>>-
>>wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>>Finally I called cvb, I believe that the -dt flag states where the inferred
>>topics should go, but because I haven't yet been able to dump them I can't
>>confirm this.
>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>>/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>>/user/sgeadmin/text_vec/dictionary.file-0 -dt 
>>/user/sgeadmin/text_cvb_document -
>>mt /user/sgeadmin/text_states
>>
>>The -k flag is the number of topics, the -nt flag is the size of the 
>>dictionary,
>>I computed this by counting the number of entries of the dictionary.file-0
>>inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the
>>number of iterations.
>>
>>If you know how to get what the document topic probabilities are from here, 
>>help
>>would be most appreciated!
>>
>>Kind Regards,
>>Folcon
>
>
>
>-- 
>
>
>  -jake
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to