Re: Using Mahout to train an CVB and retrieve it's topics

Folcon Red Sun, 29 Jul 2012 10:14:45 -0700

Hi Guys,

Thanks for replying, the problem is whenever I use any -s flag I get the
error "Unexpected -s while processing Job-Specific Options:"


Also I'm not sure if this is supposed to be the output of -dt

sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
starcluster
Found 3 items
-rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51
/user/sgeadmin/text_cvb_document/_SUCCESS
drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50
/user/sgeadmin/text_cvb_document/_logs
-rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51
/user/sgeadmin/text_cvb_document/part-m-00000

Should I be using a newer version of mahout? I've just been using the 0.7
distribution so far as apparently the compiled versions are missing parts
that the distributed ones have.

Kind Regards,
Folcon

PS: Thanks for the help so far!

On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:

>
>
> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>
>> Hi Folcon,
>>
>> In the folder you specified for the –dt option for cvb command
>> there should be sequence files with the document to topic associations
>> (Key:
>> IntWritable, Value: VectorWritable).
>
>
> Yeah, this is correct, although this:
>
>
>> You can dump in text format as: mahout seqdumper –s <sequence file>
>>
>
> is not as good as using vectordumper:
>
>    mahout vectordump -s <sequence file> --dictionary <path to 
> dictionary.file-0>
> \
>        --dictionaryType seqfile --vectorSize <num entries per topic you
> want to see> -sort
>
> This joins your topic vectors with the dictionary, then picks out the top
> k terms (with their
> probabilities) for each topic and prints them to the console (or to the
> file you specify with
> an --output option).
>
> *although* I notice now that in trunk when I just checked,
> VectorDumper.java had a bug
> in it for "vectorSize" - line 175 asks for cmdline option 
> "numIndexesPerVector"
> not
> vectorSize, ack!  So I took the liberty of fixing that, but you'll need to
> "svn up" and rebuild
> your jar before using vectordump like this.
>
>
>> So in text output from seqdumper, the key is a document id and the vector
>> contains
>> the topics and associated scores associated with the document.  I think
>> all topics are listed for each
>> document but many with near zero score.
>> In my case I used rowid to convert keys of original sparse
>> document vectors from Text to Integer before running cvb and this
>> generates a mapping file so I know the textual
>> keys that correspond to the numeric document ids (since my original
>> document ids were file names and I created named vectors).
>> Hope this helps.
>> Dan
>>
>> ________________________________
>>
>>  From: Folcon <[email protected]>
>> To: [email protected]
>> Sent: Saturday, July 28, 2012 8:28 PM
>> Subject: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Everyone,
>>
>> I'm posting this as my original message did not seem to appear on the
>> mailing
>> list, I'm very sorry if I have done this in error.
>>
>> I'm doing this to then use the topics to train a maxent algorithm to
>> predict the
>> classes of documents given their topic mixtures. Any further aid in this
>> direction would be appreciated!
>>
>> I've been trying to extract the topics out of my run of cvb. Here's what
>> I did
>> so far.
>>
>> Ok, so I still don't know how to output the topics, but I have worked out
>> how to
>> get the cvb and what I think are the document vectors, however I'm not
>> having
>> any luck dumping them, so help here would still be appreciated!
>>
>> I set the values of:
>>     export MAHOUT_HOME=/home/sgeadmin/mahout
>>     export HADOOP_HOME=/usr/lib/hadoop
>>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>> on the master otherwise none of this works.
>>
>> So first I uploaded the documents using starclusters put:
>>     starcluster put mycluster text_train /home/sgeadmin/
>>     starcluster put mycluster text_test /home/sgeadmin/
>>
>> Then I added them to hadoop's hbase filesystem:
>>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>> starcluster
>>
>> Then I called Mahout's seqdirectory to turn the text into sequence files
>>     $MAHOUT_HOME/bin/mahout seqdirectory --input
>> /user/sgeadmin/text_train --
>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>>
>> Then I called Mahout's seq2parse to turn them into vectors
>>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o
>> /user/sgeadmin/text_vec -
>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> Finally I called cvb, I believe that the -dt flag states where the
>> inferred
>> topics should go, but because I haven't yet been able to dump them I can't
>> confirm this.
>>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
>> /user/sgeadmin/text_cvb_document -
>> mt /user/sgeadmin/text_states
>>
>> The -k flag is the number of topics, the -nt flag is the size of the
>> dictionary,
>> I computed this by counting the number of entries of the dictionary.file-0
>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is
>> the
>> number of iterations.
>>
>> If you know how to get what the document topic probabilities are from
>> here, help
>> would be most appreciated!
>>
>> Kind Regards,
>> Folcon
>>
>
>
>
> --
>
>   -jake
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to