Hi Charly, I've been playing around with cvb, too. I have a few thoughts on b, vectordump:
What are you doing vectordump on? If you're doing it on your cvb output, you're getting something like a dictionary per topic, with <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it on cvb-topics output, for each document, you're getting the likelihood that it belongs to each of your topics. I wonder if your problem is that you read the same book I did, "Hadoop MapReduce Cookbook," that advised to use vectordump with the dictionary flag as your dictionary from s2s. Don't do that - that translates your document or topic keys as if they were your vocab keys, and it's just completely nonsensical. Best, Liz Merkhofer On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde < [email protected]> wrote: > Hi everyone, I am experimenting with cvb algorithm and I have a few > questions.... > > a) Is there any updated documentation? I have been collecting info from > mail lists, blogs, etc. I have been writing a small beginers tutorial, if > you like I'll send it. > > b) Should I remove "stop-words" before building the feature vectors ? I am > having some trouble "reading" the results.... > > c) Vectordump is not sorting well...is this a reported bug ? ( I am > building mahout from trunk now ) > > d) Any considerations on performance? It took 10 hours on a 5 node cluster > and I've set 20 iterations on less than 10.000 docs and it took > > Thanks! > Charly >
