Charly, The documents I've used for CVB is quite noisy, removing stopwords is key.
To remove the stopwords. seq2sparse provides a few options for the analyzer to choose what the stopwords are: --minDF --maxDFPercent You can write your own sequenceFile parser which will give you more control, but that's a lot more work. I believe removing extra "stopwords" has performance implications in terms of memory utilization of the CVB algorithm. It can mean a drastic reduction in memory allocated. As for using vectordump. Some of the output files are more easily read with 'mahout seqdumper' - Corey On Fri, Aug 23, 2013 at 10:24 AM, Suneel Marthi <[email protected]>wrote: > Charly, > > If the documentation isn't clear, I would like at the CVB example in > examples/bin/cluster_reuters.sh for the correct sequence of steps and the > various parameters. > > > > > ________________________________ > From: Charly Lizarralde <[email protected]> > To: [email protected] > Sent: Friday, August 23, 2013 11:45 AM > Subject: Re: lda + vector dump > > > I think I am doing it on the cvb output ( 1 record per topic ) so > dictionary is used to output the topic most relevant terms....but I'll > check! > > > On Fri, Aug 23, 2013 at 12:37 PM, Liz Merkhofer < > [email protected]> wrote: > > > Hi Charly, > > > > I've been playing around with cvb, too. I have a few thoughts on b, > > vectordump: > > > > What are you doing vectordump on? If you're doing it on your cvb output, > > you're getting something like a dictionary per topic, with > > <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it > on > > cvb-topics output, for each document, you're getting the likelihood that > it > > belongs to each of your topics. > > > > I wonder if your problem is that you read the same book I did, "Hadoop > > MapReduce Cookbook," that advised to use vectordump with the dictionary > > flag as your dictionary from s2s. Don't do that - that translates your > > document or topic keys as if they were your vocab keys, and it's just > > completely nonsensical. > > > > Best, > > Liz Merkhofer > > > > > > > > On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde < > > [email protected]> wrote: > > > > > Hi everyone, I am experimenting with cvb algorithm and I have a few > > > questions.... > > > > > > a) Is there any updated documentation? I have been collecting info from > > > mail lists, blogs, etc. I have been writing a small beginers tutorial, > if > > > you like I'll send it. > > > > > > b) Should I remove "stop-words" before building the feature vectors ? I > > am > > > having some trouble "reading" the results.... > > > > > > c) Vectordump is not sorting well...is this a reported bug ? ( I am > > > building mahout from trunk now ) > > > > > > d) Any considerations on performance? It took 10 hours on a 5 node > > cluster > > > and I've set 20 iterations on less than 10.000 docs and it took > > > > > > Thanks! > > > Charly > > > > > >
