Re: Token filtering and LDA quality

Jake Mannix Thu, 26 Jan 2012 16:47:46 -0800

On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <[email protected]> wrote:


> One more question.  What about vector normalization when you vectorize your
> documents.  Would this help with topic model quality?
>

No, unless you have reason to feel that document length is definitely *not*
an indicator of how much topical information is being provided.  So if
you're
building topic models off of webpages, and a page has only 20 words on it,
do you *want* it to have the same impact on the overall topic model as a
big page with 2000 words on it?  Maybe you do, if you've got a good reason,
but I can't think of a domain-independent reason to do that.


>
> On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote:
>
> > Thanks for all the feedback!  You've been a big help.
> >
> >
> > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected]
> >wrote:
> >
> >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote:
> >>
> >> > Hey Jake,
> >> > Thanks for the tips.  That will definitely help.
> >> >
> >> > One more question, do you know if the topic model quality will be
> >> affected
> >> > by the document length?
> >>
> >>
> >> Yes, very much so.
> >>
> >>
> >> >  I'm thinking lengths ranging from tweets (~20 words),
> >>
> >>
> >> Tweets suck.  Trust me on this. ;)
> >>
> >>
> >> > to emails (hundreds of words),
> >>
> >>
> >> Fantastic size.
> >>
> >>
> >> > to whitepapers (thousands of words)
> >> >
> >>
> >> Can be pretty great too.
> >>
> >>
> >> > to books (boat loads of words).
> >>
> >>
> >> This is too long.  There will be tons and tons of topics in a book,
> often.
> >> But, frankly, I have not tried with huge documents personally, so I
> can't
> >> say from experience that it won't work.  I'd just not be terribly
> >> surprised
> >> if it didn't work well at all.  If I had a bunch of books I wanted to
> run
> >> LDA
> >> on, I'd maybe treat each page or each chapter as a separate document.
> >>
> >>  -jake
> >>
> >>  What lengths'ish would degrade topic model
> >> > quality.
> >> >
> >> > I would think tweets would kind'a suck, but what about longer docs?
> >>  Should
> >> > they be segmented into sub-documents?
> >> >
> >> > Thanks,
> >> > JohnC
> >> >
> >> >
> >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi John,
> >> > >
> >> > >  I'm not an expert in the field, but I have done a bit of work
> >> building
> >> > > topic
> >> > > models with LDA, and here are some of the "tricks" I've used:
> >> > >
> >> > >  1) yes remove stop words, in fact remove all words occurring in
> more
> >> > than
> >> > > (say) half (or more conservatively, 90%) of your documents, as
> >> they'll be
> >> > > noise
> >> > > and just dominate your topics.
> >> > >
> >> > >  2) more features is better, if you have the memory for it (note
> that
> >> > > mahout's
> >> > > LDA currently holds numTopics * numFeatures in memory in the mapper
> >> > tasks,
> >> > > which means that you are usually bounded to a few hundred thousand
> >> > > features,
> >> > > maybe up as high as a million, currently).  So don't stem, and throw
> >> in
> >> > > commonly occurring (or more importantly: high log-likelihood)
> bigrams
> >> and
> >> > > trigrams as independent features.
> >> > >
> >> > >  3) violate the underlying assumption of LDA, that you're talking
> >> about
> >> > > "token
> >> > > occurrences", and weight your vectors not as "tf", but "tf*idf",
> which
> >> > > makes rarer
> >> > > features more prominent, which ends up making your topics look a lot
> >> > nicer.
> >> > >
> >> > > Those are the main tricks I can think of right now.
> >> > >
> >> > > If you're using Mahout trunk, try the new LDA impl:
> >> > >
> >> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> >> > >
> >> > > It operates on the same kind of input as the last one (ie. a corpus
> >> which
> >> > > is
> >> > > a SequenceFile<IntWritable, VectorWritable>).
> >> > >
> >> > >  -jake
> >> > >
> >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]>
> >> wrote:
> >> > >
> >> > > > I'm trying to find out if there are any standard best practices
> for
> >> > > > document tokenization when prepping your data for LDA in order to
> >> get a
> >> > > > higher quality topic model, and to understand how the feature
> space
> >> > > affects
> >> > > > topic model quality.
> >> > > >
> >> > > > For example, will the topic model be "better" if there is a more
> >> rich
> >> > > > feature space by not stemming terms, or is it better to have a
> more
> >> > > > normalized feature space by applying stemming?
> >> > > >
> >> > > > Is it better to filter out stop words, or keep them in?
> >> > > >
> >> > > > Is it better to include bi and/or tri grams of highly correlated
> >> terms
> >> > in
> >> > > > the feature space?
> >> > > >
> >> > > > In essence what characteristics of the feature space that LDA uses
> >> for
> >> > > > input will create a higher quality topic model.
> >> > > >
> >> > > > Thanks,
> >> > > > JohnC
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Thanks,
> >> > John C
> >> >
> >>
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
> >
>
>
> --
>
> Thanks,
> John C
>

Re: Token filtering and LDA quality

Reply via email to