Understanding the output of cvb

Natalia Connolly Thu, 20 Mar 2014 13:09:51 -0700

Hello,

   I am using mahout 0.9 and hadoop 1.2.1.  I've just run cub on a bunch of
documents, and I can output the top N words per topic using vector dump.
 What I don't understand is how to get the actual topics as strings.  When
I do something like this:


./bin/mahout vectordump -i doc-topics/part-m-00000 -vs 10 -p true -d
/tmp/vectors/dictionary.file-0 -dt sequencefile -sort doc-topcs/part-m-00000

   where doc-topics was the argument to -dt in cvb, I get long strings like
this:

0
{0440:0.9939875638122325,1.4:0.001275482384898655,1.030:5.288568269923396E-4,0.70:4.4729086334168653E-4,1.1:1.4463856875947443E-4,0425:1.377610623100449E-4,0540:1.1816575291393311E-4,1.50:1.0024999339110217E-4,0400:9.126784636797175E-5,0623:9.1173928106031E-5}

  and what I would like to get is the actual list of topics, as words.  Is
there any way I could do that?  A long and exhaustive google search did now
show anything.

    Thanks!

    Natalia

Understanding the output of cvb

Reply via email to