On Fri, Aug 23, 2013 at 10:49 AM, Corey Hyllested <[email protected] > wrote:
> Charly, > > The documents I've used for CVB is quite noisy, removing stopwords is key. > > > To remove the stopwords. seq2sparse provides a few options for the analyzer > to choose what the stopwords are: > --minDF > --maxDFPercent > > You can write your own sequenceFile parser which will give you more > control, but that's a lot more work. > > I believe removing extra "stopwords" has performance implications in terms > of memory utilization of the CVB algorithm. It can mean a drastic > reduction in memory allocated. > Actually, the thing which reduces memory overhead is the reverse of the stopwords: the tail of the language model, those terms which occur only 1-5 times. "--minDF 5" will chop off those rare terms. "--maxDFPercent 90" will chop out those pesky stop words which occur in 90+% of your documents. > As for using vectordump. Some of the output files are more easily read > with 'mahout seqdumper' > Use vectordump for vector sequence files, as it knows how to join together your numeric vectors with your textual dictionary. But seqdumper is yes more general purpose and will print out the string form of any sequencefile you have handy. > > > > - Corey > > > On Fri, Aug 23, 2013 at 10:24 AM, Suneel Marthi <[email protected] > >wrote: > > > Charly, > > > > If the documentation isn't clear, I would like at the CVB example in > > examples/bin/cluster_reuters.sh for the correct sequence of steps and the > > various parameters. > > > > > > > > > > ________________________________ > > From: Charly Lizarralde <[email protected]> > > To: [email protected] > > Sent: Friday, August 23, 2013 11:45 AM > > Subject: Re: lda + vector dump > > > > > > I think I am doing it on the cvb output ( 1 record per topic ) so > > dictionary is used to output the topic most relevant terms....but I'll > > check! > > > > > > On Fri, Aug 23, 2013 at 12:37 PM, Liz Merkhofer < > > [email protected]> wrote: > > > > > Hi Charly, > > > > > > I've been playing around with cvb, too. I have a few thoughts on b, > > > vectordump: > > > > > > What are you doing vectordump on? If you're doing it on your cvb > output, > > > you're getting something like a dictionary per topic, with > > > <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it > > on > > > cvb-topics output, for each document, you're getting the likelihood > that > > it > > > belongs to each of your topics. > > > > > > I wonder if your problem is that you read the same book I did, "Hadoop > > > MapReduce Cookbook," that advised to use vectordump with the dictionary > > > flag as your dictionary from s2s. Don't do that - that translates your > > > document or topic keys as if they were your vocab keys, and it's just > > > completely nonsensical. > > > > > > Best, > > > Liz Merkhofer > > > > > > > > > > > > On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde < > > > [email protected]> wrote: > > > > > > > Hi everyone, I am experimenting with cvb algorithm and I have a few > > > > questions.... > > > > > > > > a) Is there any updated documentation? I have been collecting info > from > > > > mail lists, blogs, etc. I have been writing a small beginers > tutorial, > > if > > > > you like I'll send it. > > > > > > > > b) Should I remove "stop-words" before building the feature vectors > ? I > > > am > > > > having some trouble "reading" the results.... > > > > > > > > c) Vectordump is not sorting well...is this a reported bug ? ( I am > > > > building mahout from trunk now ) > > > > > > > > d) Any considerations on performance? It took 10 hours on a 5 node > > > cluster > > > > and I've set 20 iterations on less than 10.000 docs and it took > > > > > > > > Thanks! > > > > Charly > > > > > > > > > > -- -jake
