Re: lda + vector dump

Corey Hyllested Fri, 23 Aug 2013 10:49:53 -0700

Charly,

The documents I've used for CVB is quite noisy, removing stopwords is key.



To remove the stopwords. seq2sparse provides a few options for the analyzer
to choose what the stopwords are:
--minDF
--maxDFPercent

You can write your own sequenceFile parser which will give you more
control, but that's a lot more work.

I believe removing extra "stopwords" has performance implications in terms
of memory utilization of the CVB algorithm.  It can mean a drastic
reduction in memory allocated.

As for using vectordump.  Some of the output files are more easily read
with 'mahout seqdumper'



- Corey


On Fri, Aug 23, 2013 at 10:24 AM, Suneel Marthi <[email protected]>wrote:

> Charly,
>
> If the documentation isn't clear, I would like at the CVB example in
> examples/bin/cluster_reuters.sh for the correct sequence of steps and the
> various parameters.
>
>
>
>
> ________________________________
>  From: Charly Lizarralde <[email protected]>
> To: [email protected]
> Sent: Friday, August 23, 2013 11:45 AM
> Subject: Re: lda + vector dump
>
>
> I think I am doing it on the cvb output ( 1 record per topic ) so
> dictionary is used to output the topic most relevant terms....but I'll
> check!
>
>
> On Fri, Aug 23, 2013 at 12:37 PM, Liz Merkhofer <
> [email protected]> wrote:
>
> > Hi Charly,
> >
> > I've been playing around with cvb, too. I have a few thoughts on b,
> > vectordump:
> >
> > What are you doing vectordump on? If you're doing it on your cvb output,
> > you're getting something like a dictionary per topic, with
> > <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it
> on
> > cvb-topics output, for each document, you're getting the likelihood that
> it
> > belongs to each of your topics.
> >
> > I wonder if your problem is that you read the same book I did, "Hadoop
> > MapReduce Cookbook," that advised to use vectordump with the dictionary
> > flag as your dictionary from s2s. Don't do that - that translates your
> > document or topic keys as if they were your vocab keys, and it's just
> > completely nonsensical.
> >
> > Best,
> > Liz Merkhofer
> >
> >
> >
> > On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde <
> > [email protected]> wrote:
> >
> > > Hi everyone, I am experimenting with cvb algorithm and I have a few
> > > questions....
> > >
> > > a) Is there any updated documentation? I have been collecting info from
> > > mail lists, blogs, etc. I have been writing a small beginers tutorial,
> if
> > > you like I'll send it.
> > >
> > > b) Should I remove "stop-words" before building the feature vectors ? I
> > am
> > > having some trouble "reading" the results....
> > >
> > > c) Vectordump is not sorting well...is this a reported bug ? ( I am
> > > building mahout from trunk now )
> > >
> > > d) Any considerations on performance? It took 10 hours on a 5 node
> > cluster
> > > and  I've set 20 iterations on less than 10.000 docs and it took
> > >
> > > Thanks!
> > > Charly
> > >
> >
>

Re: lda + vector dump

Reply via email to