Re: Token filtering and LDA quality

John Conwell Tue, 24 Jan 2012 16:11:50 -0800

Thanks for all the feedback!  You've been a big help.

On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected]> wrote:


> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote:
>
> > Hey Jake,
> > Thanks for the tips.  That will definitely help.
> >
> > One more question, do you know if the topic model quality will be
> affected
> > by the document length?
>
>
> Yes, very much so.
>
>
> >  I'm thinking lengths ranging from tweets (~20 words),
>
>
> Tweets suck.  Trust me on this. ;)
>
>
> > to emails (hundreds of words),
>
>
> Fantastic size.
>
>
> > to whitepapers (thousands of words)
> >
>
> Can be pretty great too.
>
>
> > to books (boat loads of words).
>
>
> This is too long.  There will be tons and tons of topics in a book, often.
> But, frankly, I have not tried with huge documents personally, so I can't
> say from experience that it won't work.  I'd just not be terribly surprised
> if it didn't work well at all.  If I had a bunch of books I wanted to run
> LDA
> on, I'd maybe treat each page or each chapter as a separate document.
>
>  -jake
>
>  What lengths'ish would degrade topic model
> > quality.
> >
> > I would think tweets would kind'a suck, but what about longer docs?
>  Should
> > they be segmented into sub-documents?
> >
> > Thanks,
> > JohnC
> >
> >
> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > Hi John,
> > >
> > >  I'm not an expert in the field, but I have done a bit of work building
> > > topic
> > > models with LDA, and here are some of the "tricks" I've used:
> > >
> > >  1) yes remove stop words, in fact remove all words occurring in more
> > than
> > > (say) half (or more conservatively, 90%) of your documents, as they'll
> be
> > > noise
> > > and just dominate your topics.
> > >
> > >  2) more features is better, if you have the memory for it (note that
> > > mahout's
> > > LDA currently holds numTopics * numFeatures in memory in the mapper
> > tasks,
> > > which means that you are usually bounded to a few hundred thousand
> > > features,
> > > maybe up as high as a million, currently).  So don't stem, and throw in
> > > commonly occurring (or more importantly: high log-likelihood) bigrams
> and
> > > trigrams as independent features.
> > >
> > >  3) violate the underlying assumption of LDA, that you're talking about
> > > "token
> > > occurrences", and weight your vectors not as "tf", but "tf*idf", which
> > > makes rarer
> > > features more prominent, which ends up making your topics look a lot
> > nicer.
> > >
> > > Those are the main tricks I can think of right now.
> > >
> > > If you're using Mahout trunk, try the new LDA impl:
> > >
> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> > >
> > > It operates on the same kind of input as the last one (ie. a corpus
> which
> > > is
> > > a SequenceFile<IntWritable, VectorWritable>).
> > >
> > >  -jake
> > >
> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]>
> wrote:
> > >
> > > > I'm trying to find out if there are any standard best practices for
> > > > document tokenization when prepping your data for LDA in order to
> get a
> > > > higher quality topic model, and to understand how the feature space
> > > affects
> > > > topic model quality.
> > > >
> > > > For example, will the topic model be "better" if there is a more rich
> > > > feature space by not stemming terms, or is it better to have a more
> > > > normalized feature space by applying stemming?
> > > >
> > > > Is it better to filter out stop words, or keep them in?
> > > >
> > > > Is it better to include bi and/or tri grams of highly correlated
> terms
> > in
> > > > the feature space?
> > > >
> > > > In essence what characteristics of the feature space that LDA uses
> for
> > > > input will create a higher quality topic model.
> > > >
> > > > Thanks,
> > > > JohnC
> > > >
> > >
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Re: Token filtering and LDA quality

Reply via email to