Hello,
I am using mahout 0.9 and hadoop 1.2.1. I've just run cub on a bunch of
documents, and I can output the top N words per topic using vector dump.
What I don't understand is how to get the actual topics as strings. When
I do something like this:
./bin/mahout vectordump -i doc-topics/part-m-00000 -vs 10 -p true -d
/tmp/vectors/dictionary.file-0 -dt sequencefile -sort doc-topcs/part-m-00000
where doc-topics was the argument to -dt in cvb, I get long strings like
this:
0
{0440:0.9939875638122325,1.4:0.001275482384898655,1.030:5.288568269923396E-4,0.70:4.4729086334168653E-4,1.1:1.4463856875947443E-4,0425:1.377610623100449E-4,0540:1.1816575291393311E-4,1.50:1.0024999339110217E-4,0400:9.126784636797175E-5,0623:9.1173928106031E-5}
and what I would like to get is the actual list of topics, as words. Is
there any way I could do that? A long and exhaustive google search did now
show anything.
Thanks!
Natalia