Charly, If the documentation isn't clear, I would like at the CVB example in examples/bin/cluster_reuters.sh for the correct sequence of steps and the various parameters.
________________________________ From: Charly Lizarralde <[email protected]> To: [email protected] Sent: Friday, August 23, 2013 11:45 AM Subject: Re: lda + vector dump I think I am doing it on the cvb output ( 1 record per topic ) so dictionary is used to output the topic most relevant terms....but I'll check! On Fri, Aug 23, 2013 at 12:37 PM, Liz Merkhofer < [email protected]> wrote: > Hi Charly, > > I've been playing around with cvb, too. I have a few thoughts on b, > vectordump: > > What are you doing vectordump on? If you're doing it on your cvb output, > you're getting something like a dictionary per topic, with > <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it on > cvb-topics output, for each document, you're getting the likelihood that it > belongs to each of your topics. > > I wonder if your problem is that you read the same book I did, "Hadoop > MapReduce Cookbook," that advised to use vectordump with the dictionary > flag as your dictionary from s2s. Don't do that - that translates your > document or topic keys as if they were your vocab keys, and it's just > completely nonsensical. > > Best, > Liz Merkhofer > > > > On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde < > [email protected]> wrote: > > > Hi everyone, I am experimenting with cvb algorithm and I have a few > > questions.... > > > > a) Is there any updated documentation? I have been collecting info from > > mail lists, blogs, etc. I have been writing a small beginers tutorial, if > > you like I'll send it. > > > > b) Should I remove "stop-words" before building the feature vectors ? I > am > > having some trouble "reading" the results.... > > > > c) Vectordump is not sorting well...is this a reported bug ? ( I am > > building mahout from trunk now ) > > > > d) Any considerations on performance? It took 10 hours on a 5 node > cluster > > and I've set 20 iterations on less than 10.000 docs and it took > > > > Thanks! > > Charly > > >
