On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]> wrote:

> Hi Folcon,
>
> In the folder you specified for the –dt option for cvb command
> there should be sequence files with the document to topic associations
> (Key:
> IntWritable, Value: VectorWritable).


Yeah, this is correct, although this:


> You can dump in text format as: mahout seqdumper –s <sequence file>
>

is not as good as using vectordumper:

   mahout vectordump -s <sequence file> --dictionary <path to
dictionary.file-0>
\
       --dictionaryType seqfile --vectorSize <num entries per topic you
want to see> -sort

This joins your topic vectors with the dictionary, then picks out the top k
terms (with their
probabilities) for each topic and prints them to the console (or to the
file you specify with
an --output option).

*although* I notice now that in trunk when I just checked,
VectorDumper.java had a bug
in it for "vectorSize" - line 175 asks for cmdline option "numIndexesPerVector"
not
vectorSize, ack!  So I took the liberty of fixing that, but you'll need to
"svn up" and rebuild
your jar before using vectordump like this.


> So in text output from seqdumper, the key is a document id and the vector
> contains
> the topics and associated scores associated with the document.  I think
> all topics are listed for each
> document but many with near zero score.
> In my case I used rowid to convert keys of original sparse
> document vectors from Text to Integer before running cvb and this
> generates a mapping file so I know the textual
> keys that correspond to the numeric document ids (since my original
> document ids were file names and I created named vectors).
> Hope this helps.
> Dan
>
> ________________________________
>  From: Folcon <[email protected]>
> To: [email protected]
> Sent: Saturday, July 28, 2012 8:28 PM
> Subject: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Everyone,
>
> I'm posting this as my original message did not seem to appear on the
> mailing
> list, I'm very sorry if I have done this in error.
>
> I'm doing this to then use the topics to train a maxent algorithm to
> predict the
> classes of documents given their topic mixtures. Any further aid in this
> direction would be appreciated!
>
> I've been trying to extract the topics out of my run of cvb. Here's what I
> did
> so far.
>
> Ok, so I still don't know how to output the topics, but I have worked out
> how to
> get the cvb and what I think are the document vectors, however I'm not
> having
> any luck dumping them, so help here would still be appreciated!
>
> I set the values of:
>     export MAHOUT_HOME=/home/sgeadmin/mahout
>     export HADOOP_HOME=/usr/lib/hadoop
>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> on the master otherwise none of this works.
>
> So first I uploaded the documents using starclusters put:
>     starcluster put mycluster text_train /home/sgeadmin/
>     starcluster put mycluster text_test /home/sgeadmin/
>
> Then I added them to hadoop's hbase filesystem:
>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster
>
> Then I called Mahout's seqdirectory to turn the text into sequence files
>     $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> --
> output /user/sgeadmin/text_seq -c UTF-8 -ow
>
> Then I called Mahout's seq2parse to turn them into vectors
>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o
> /user/sgeadmin/text_vec -
> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>
> Finally I called cvb, I believe that the -dt flag states where the inferred
> topics should go, but because I haven't yet been able to dump them I can't
> confirm this.
>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> /user/sgeadmin/text_cvb_document -
> mt /user/sgeadmin/text_states
>
> The -k flag is the number of topics, the -nt flag is the size of the
> dictionary,
> I computed this by counting the number of entries of the dictionary.file-0
> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is
> the
> number of iterations.
>
> If you know how to get what the document topic probabilities are from
> here, help
> would be most appreciated!
>
> Kind Regards,
> Folcon
>



-- 

  -jake

Reply via email to