Re: Using Mahout to train an CVB and retrieve it's topics

Jake Mannix Sun, 29 Jul 2012 10:45:12 -0700

On Sun, Jul 29, 2012 at 10:08 AM, Folcon Red <[email protected]> wrote:


> Hi Guys,
>
> Thanks for replying, the problem is whenever I use any -s flag I get the
> error "Unexpected -s while processing Job-Specific Options:"
>

-s is the old way of doing input (short for "sequencefile"), it's now
--input or -i


>
> Also I'm not sure if this is supposed to be the output of -dt
>
> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> starcluster
> Found 3 items
> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51
> /user/sgeadmin/text_cvb_document/_SUCCESS
> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50
> /user/sgeadmin/text_cvb_document/_logs
> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51
> /user/sgeadmin/text_cvb_document/part-m-00000
>
> Should I be using a newer version of mahout? I've just been using the 0.7
> distribution so far as apparently the compiled versions are missing parts
> that the distributed ones have.
>
> Kind Regards,
> Folcon
>
> PS: Thanks for the help so far!
>
> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>
>>
>>
>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:
>>
>>> Hi Folcon,
>>>
>>> In the folder you specified for the –dt option for cvb command
>>> there should be sequence files with the document to topic associations
>>> (Key:
>>> IntWritable, Value: VectorWritable).
>>
>>
>> Yeah, this is correct, although this:
>>
>>
>>> You can dump in text format as: mahout seqdumper –s <sequence file>
>>>
>>
>> is not as good as using vectordumper:
>>
>>    mahout vectordump -s <sequence file> --dictionary <path to 
>> dictionary.file-0>
>> \
>>        --dictionaryType seqfile --vectorSize <num entries per topic you
>> want to see> -sort
>>
>> This joins your topic vectors with the dictionary, then picks out the top
>> k terms (with their
>> probabilities) for each topic and prints them to the console (or to the
>> file you specify with
>> an --output option).
>>
>> *although* I notice now that in trunk when I just checked,
>> VectorDumper.java had a bug
>> in it for "vectorSize" - line 175 asks for cmdline option 
>> "numIndexesPerVector"
>> not
>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
>> to "svn up" and rebuild
>> your jar before using vectordump like this.
>>
>>
>>>  So in text output from seqdumper, the key is a document id and the
>>> vector contains
>>> the topics and associated scores associated with the document.  I think
>>> all topics are listed for each
>>> document but many with near zero score.
>>> In my case I used rowid to convert keys of original sparse
>>> document vectors from Text to Integer before running cvb and this
>>> generates a mapping file so I know the textual
>>> keys that correspond to the numeric document ids (since my original
>>> document ids were file names and I created named vectors).
>>> Hope this helps.
>>> Dan
>>>
>>> ________________________________
>>>
>>>  From: Folcon <[email protected]>
>>> To: [email protected]
>>> Sent: Saturday, July 28, 2012 8:28 PM
>>> Subject: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Hi Everyone,
>>>
>>> I'm posting this as my original message did not seem to appear on the
>>> mailing
>>> list, I'm very sorry if I have done this in error.
>>>
>>> I'm doing this to then use the topics to train a maxent algorithm to
>>> predict the
>>> classes of documents given their topic mixtures. Any further aid in this
>>> direction would be appreciated!
>>>
>>> I've been trying to extract the topics out of my run of cvb. Here's what
>>> I did
>>> so far.
>>>
>>> Ok, so I still don't know how to output the topics, but I have worked
>>> out how to
>>> get the cvb and what I think are the document vectors, however I'm not
>>> having
>>> any luck dumping them, so help here would still be appreciated!
>>>
>>> I set the values of:
>>>     export MAHOUT_HOME=/home/sgeadmin/mahout
>>>     export HADOOP_HOME=/usr/lib/hadoop
>>>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>>> on the master otherwise none of this works.
>>>
>>> So first I uploaded the documents using starclusters put:
>>>     starcluster put mycluster text_train /home/sgeadmin/
>>>     starcluster put mycluster text_test /home/sgeadmin/
>>>
>>> Then I added them to hadoop's hbase filesystem:
>>>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>>> starcluster
>>>
>>> Then I called Mahout's seqdirectory to turn the text into sequence files
>>>     $MAHOUT_HOME/bin/mahout seqdirectory --input
>>> /user/sgeadmin/text_train --
>>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>>>
>>> Then I called Mahout's seq2parse to turn them into vectors
>>>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o
>>> /user/sgeadmin/text_vec -
>>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>>
>>> Finally I called cvb, I believe that the -dt flag states where the
>>> inferred
>>> topics should go, but because I haven't yet been able to dump them I
>>> can't
>>> confirm this.
>>>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
>>> /user/sgeadmin/text_cvb_document -
>>> mt /user/sgeadmin/text_states
>>>
>>> The -k flag is the number of topics, the -nt flag is the size of the
>>> dictionary,
>>> I computed this by counting the number of entries of the
>>> dictionary.file-0
>>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is
>>> the
>>> number of iterations.
>>>
>>> If you know how to get what the document topic probabilities are from
>>> here, help
>>> would be most appreciated!
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>
>>
>>
>> --
>>
>>   -jake
>>
>>
>


-- 

  -jake

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to